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Methods of iDENrtFYfi^G Patterns m Biological 
5 S v stems a no Us es Thereof 

Technical Field 

The preseni invention relates to the use of teaming machines to identify 
reSevani patterns in bioiogicai systems such as genes. gei» products, proteins, 
10 lipids, and combinations of the same. These patterns in biological systems can be 
ased to diagnose and prognose abnormal physiological states, fa addition, the 
patterns that can he delected using the present invention can be used to develop 
therapeutic agents. 

15 Background Of Tas -Itrmmoti: 

Enormous amounts of data about organisms e b r»g •* neraied in the 
sequencing of genomes. Using this information to provide treatments and 

go! the gathered 

information. Efforts using genomic information have already led to the 
20 development of gene expression investigational devices. One of fee most 
currently promising devices is the gene chip. Gene chips have arrays of 
oligonucleotide probes attached a solid base structure. Such devices are 
described in U.S. Patent Nos. 5,837,832 and 5,143,854, .herein incorporated by 
reference in their entirety. The oligonucleotide- probes present on the chip can be 
25 used to determine whether a target nucleic acid has a nucleotide sequence 
identical to or different from a specific reference sequence. The array of probes 

bes that 3re compter) e seqi 

probes that differ by one of more bases from the complementary probes. 

The gene chips rectspabh fc doing rge arrays { nucleotide 

30 n very smr. nips. A ve.net) ethod i uring hybridization irttensiij 
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lata to determine which bes f 1 i is known in the art Methods foi 
delecting hybndiz&sots include fluorescent, radioactive, enzymatic, 
chemolummescenf., biolur < i < detection systems 

Older, but still usable, methods such as gel eieetrcphosesis and 
5 hybridization to gei blots r d s {> t tr^ dsa useful for dete i r 

sequence information. Capture and detection systems for solution hybridization 
d it '"u n ii 1 >• is i' scd for determining in > • 
genome ~>dd ion j forms Mid currently u o < ds Jet mmg large 
parts ices, such as enro noyjme wa'ung and ph it 

10 establishment, are used to gain fax h k dge about genomes. 

Large amounts of information regarding the sequence, tegulati v 
activation, binding sites and internal coding signals can be generated by the 
methods known is) the art. In fact, the amount of data being generated by such 
methods hinders the derivation of useful-information. Human researchers, when 
15 aided by advanced learning tools such as neural -networks can only derive crude 
models of the underlying processes represented in the large, feature-rich datasets. 

Another area of biological investigation that can generate a huge amount 
of data is ih ' proteomies. Prcfeormcs is the study of the group 

of proteins encoded and regulated by a genome. This field represents a new focus 
20 on analyzing proteins, regulation of protein levels and the relationship to gene 
regulation and expression. Understanding the norma! or pathological state of the 
proteome of a person or a population provides information for she prognosis or 
diagnosis of disease, development oi" drug or genetic treatments, or enzyme 
replacement therapies. Current methods of studying the proteome involve 2- 
25 dimensional (2-D) gel electrophoresis of the proteins followed by analysis by 
mass spectrophotometry. A pattern of proteins at any particular time or stage in 
pathogenesis or treatment can be observed by 2-D gel electrophoresis. 
Problems arise, in identifying the thousands of proteins that are found hi cells that 
>ast. >.*. s.^rd,. ^ ! ru gob hi mass spec pi >met« $ usoa* 
30 denti _ se ^ u< c< 
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and comparing it to knos , s. 'UnfortunaK hese methods 

require m . , < ; m of the pi iteome , 

In recent years, technologies have been developed that can relate gene 
expression to pn >d >i s structure md function % - ; * t 

5 throughput analysis, nucleic acid analysis and bioinformanes technologies have 
he at i - 'j t - ' iii ex i 

with disease p i t t m ression Fhe current analytical methods are 
limited In their abilities to manage the large amounts of data generated by these 
technologies- 

10 One of the more recent advances in determining the functioning 

parameters of biological systems is the analysis of correlation of genomic 
information with protein functioning to elucidate the relationship between gene 
expression, protein function and interaction, and disease states or progression. 
Genomic activation or expression does not always mean direct changes in protein 

13 production levels or activity* Alternative processing of mRNA or post- 
transcriptional or post- translations! regulatory mechanisms may cause the activity 
of one gene to result in multiple proteins, all of which are slightly different with 
different migration patterns and biological activities. The human genome 
potentially contains 30,000 genes but the human proteome is believed to be 50 to 
20 100 times larger. Currently, there are. no methods, systems or devices for 
adequately analyzing the data generated by such biologies i investigations into the 
rt - I \ i ■! ^ 

Knowledge discovery is the » >st des b dp uct f data collection 
Recent advanc ■ an explosive growth in 

25 systems and methods f 

\ n *t 1 t A i 

sets, the cha t , nation in this 

data owin i t, i , ! With many exi tfu problem 

has become unapproachable, Thus, there remains a need for a new generation of 
30 so rated fo wledgs scovery ;o< 
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, le, the Hum > > t r I r po{ i 

gigabyte database d.s t - s mapping of 

w.Ld to giow 

significantly. The vast amount of data is s 
5 tools for data analysis, such us sp si i, i 1 tditional 
methods of data analysis may be used to crease informative reports from data, but 
do not have the ability to tntelligently and automatical!; issist humans in 
analyzing and finding patterns of ^ L -mv. , i \ >- smo-mts oi data 
Likewise, using traditionally accepted reference ranges and standards for 
30 interpretation, it is often impossible for humans to identify patterns of useful 
knowledge even with very small amounts of data, 

In recent years, machine-learning approaches for data, analysis have been 
widely explored for recognizing patterns which, in turn, allow extraction of 
significant information contained within a large data set which may also include 

15 data that provide nothing more than irrelevant detail. Learning machines 
comprise algorithms that may be trained to generalize using data with known 
outcomes. Trained learning machine algorithms may then be applied to predict 
the outcome in cases of unknown outcome. Machine-learning approaches, which 
include neural networks, hidden Markov n "woks and support 

20 vector machines, are ideally suited for domains characterized by the existence of 
targe amounts of data, noisy patterns and the absence of general theories. 

The ma c , truing ma ncs that have been investigated are neural 
networks trained using hack-propagation, a gradient-based method in which 
errors in classification of training data are propagated backwards through the 

25 network to adjust the bias weights of the network elements until the mean 
squared error is minimized. A significant drawback of back-propagation neural 
networks is that the empirical risk function may have many local minlmums, a 
case that can easily obscure the optimal solution from discovery. Standard 
ton vis ed back or \ atic r u net s 

30 converge to a > > ; * r* t guarame m 
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even a localized minimum is atiauied. much less the desired global minimum 
The quality of the solution obtained from a neural network depends on many 
factors. In particular, the skill of the practitioner implementing the neural 
network det ", : dus even ore ,is seemingly benign as 

5 1 don s.' r ' ' 1 us car it r r Fu more, t 

convergence of the gradien is 1 method used >n neural network learning is 
it h s e > f > sit w. A fu i I is j »ea mg 

factof nich i t n. Possibly the .largest 1 ra i . 

factor of neural networks as related to kn very is the "curse of 

10 dimensionality" associated with the disproportionate growth in required 
computational time and power for each additional feature or dimension in the 
training data. 

The shortcomings of neural networks are overcome using support vector 
machines. In general terms, a support vector machine maps input vectors into 
15 high dmrens mai 'Mite space hrouj n-iin t pi f ti on chosen a 
priori. In this high dimertsional feature space, an ^ p • g hyperplane is 

constructed. The optimal hyperplane is then used to determine things such as 
class separab • sion tit, or accuracy it n ^estimation 

Within a support vector machine, the dimensionajiy of the feature space 
20 may be huge. For example, a fourth degree polynomial mapping function causes 
a 200 dimensional input space to be mapped into a 1.6 billionth dimensional 
feature space. The kernel trick arid the Vapmk-Chervonenkis ("VC") dimension 
allow the support vector machine to thwart the "curse of dimensionality" limiting 
aches methods - *-e general iza ers from this \ery high 

25 dimensional feature space. 

If the training vectors are separated by the optimal hyperplane (or 
i 1 i i >i- pvbab.ltw* 

committing an error on a test example is bounded by she examples in the training 
seo Tbis bound depends neither on the dimensionality of the feature space, nor 
30 ofi die norm of !he vector of fo nor on the bound of the number of :he 
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input vectors. Therefore, if the optimal hyperpisne can be constructed from a 
small number of support vectors relative to the training set size, she 
i' • ab 1 i i >e big * n infwtit im nt; la) s? e Supp ri 
vector machines are dtsclos.d j £ { ^ p„t„ - 6,1 , 157 921, both 

5 of which are assigned to the. assignee of the present application and are 
incorporated octet!) by reference. 

The data generated from genomic and profeomic tests can be analyzed 
from many differen tots. The literature shows • tpproaches such as 

studies of gene clusters discovered by unsupervised teaming techniques (Alon, 

10 1999), For example, each experiment may correspond to one. patient carrying or 
not carrying a jpedfic disease (see e (Golul 1999)) In ih s case, clustering 
usually groups patients with similar clinical records. Supervised learning lias also 
been applied io the classification of proteins (Brown, 2000) and to cancer 
classification (Golub, 1999). 

15 Support vector machines provide a desirable solution for the problem of 

discovering knowledge from vast amounts of input data. However, the ability of 
a support vector machine to discover knowledge from a data set is limited in 
proportion to the information included within the training data set. Accordingly, 
there exists a need for a system and method for pre-processing data so as to 
20 augment the training data! to maximize the knowledge discovery by the support 
vector machine. 

Furthermore, the raw output from a support vector machine may not fully 
disclose the knowiedge it) the nost eadily it 1,1 ibi brm Thus, there further 
remains a need for a system and method fci post-processing data output from a 
25 support vector machine in order to maximize the value of the information 
delivered for human or further automated processing. 

In addition, the ability of a support vector machine to discover knowledge 
from data is limited by the type of kernel selected. Accordingly, there remains a 
iteco for m unproved system and method foe selecting and/or creating an 
30 ;;ppn>proite kerne; tor a support vector machine 
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Further, methods, systems and devices are needed !o manipulate the 
reformation ; 

and genomics. Also, methods, systems and devices are needed to integrate 
information from genomic, proteomk and traditional sources of biological 
5 informanon. Such information is needed for the diagnosis and prognosis of 
diseases and other changes in biological and osher systems. 

Furthermore, methods and compositions are needed for treating the 
disea n i l > > eh t t : ■ . ms that at fcmified t ti 
vector machine Once patterns or the relationship; between the data are identified 
10 by the support vector machines of die present invention and are used to detect or 
diagnose a particular disease state, diagnostic tests, including gene chips and 
tests of bodily fluids or bodily changes, and methods and compositions for 
treating the condition ate needed. 

1 5 Svmmac of the Invention 

The present invention comprises systems and methods for enhancing 
knowledge discovered from data using a learning machine in general and a 
support vector machine in particular, In particular, the present invention 
comprises methods of using a learning machine for diagnosing and prognosing 

20 changes in biological systems such as diseases. Further, once the knowledge 
discovered ftom the da'a is determined the spec da .m '« > discovered are 
used to diagnose and prognose dts a<, pkI m'>ji j > i 

such diseases are applied to the biological system. In particular, the invention is 
directed to detection of genes involved with prostate cancer and determining 

25 methods and compositions for treatment of prostate cancer. 

One embodiment of the present invention comprises preprocessing a 
training dm set in order to allow the rr >st adv atiofl of the 

s s i 1 ectoi hao!,. me >< 

more coordinates. Pre-processing the 'raining data set may compn'se identifying 

30 smssuv , e n, as < i- i> st <, : jt ^t.ps u cw«t the 
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flawed data or as app t 1j from the 

scope of the problem. Pre proce ; ! f < njt data set may also comprise 
adding dimensionality !0 each trammg data point by adding one or more new 
coordinates to the vector. The new coordinates added to the vector may be 
5 uer , . f t e- 

The transformation rosy be based on expert knowledge, or may be- 
computationally derived, hi a situation where the training data set comprises 3 

nuous ' ' mi t > T < - ; \ ... np se optimally - ' 
continuous variable of die naming data set. 

10 fn a preferred embodiment, the support vector machine is trained using 

the pre-processed training data set. in this manner, the additional representations 
of the training data provided by the preprocessing may enhance the learning 
machine's ability to discover knowledge therefrom. In the particular context of 
support vector machines, the greater the dimensionality of the training set, the 

15 higher the quality of the generalizations that may be derived therefrom. When the 
knowledge to be discovered from the data relates to ft regression or density 
estimation or where the training output comprises a continuous variable, the 
training output may be post-processed by optimally categorizing the- training 
output to derivr . m the continuous variable. 

20 iomc ol i .< . 1 1 v < < rre n t 

and therefore more harmful dian useful hi a prefes a mbod nent of the 
invention, the pre-processing also consists of removing coordinates athat are 
irrelevant to the problem at hand by using a filter technique. Such filter 
techniques are known so those skilled in the art and may include correlation 

25 coefficients and the selection of the first few principal components. 

A test data set is pre-processed in the same manner as was the training 
data set. Then, the trained learning machine, is tested using she pre-processed test 
data set A test output of the trained learning machine may be post-processing to 
determine if the test output is an optima: solution Post- processing the test output 

30 e it 1 s 
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with the test c!4U set Ahema H » -> i bance the human 

mterpretabriity or suiiabi!ii> r additional pr go the output data. 

In the context of a support vector machine, the present invention also 
provides for the selection of at .-east one kernel prior to training the suppott vector 
5 machine. The selection of a kernel may be based on prior knowledge of the 
spectfie prohi i essed ot analysis of h< properties of tan iva iblc 

data to be used with the • . - . _ • dep< nt on the nature 

of the knowledge to be discovered from the data. Optionally, an Iterative process 
comparing postprocessed iraining outputs or test outputs can be applied to make a 
10 determination as to which configuration provides the optimal solution.. If the test 
Output is not the optimal solution, the selection of the kernel may be adjusted and 
the support vector machine may be retrained and retested. When it is determined 
that the optimal solution has been identified, a live data set may be collected and 
pre-processed in she same manner as w as the training data set. The pre-processed 
15 live data set is input into the learning machine for processing. The live output of 
she learning machine may then be post-processed by interpreting the live output 
into a computationally derived alphanumeric classifier or other form suitable to 
further utilization of the SVM derived answer. 

mem a system is provided enhancing knowledge 

20 diSuwered iotn data t \< ipp< > .mne. The exemplary system 

comprises a storage device for storing a training data set and a test data set, and a 
processor for executing a support vector machine. The processor is also operable 
' 1 e training da itabase. { l > i 

data set to enhance each of a plot ung the support 

25 vector machine using the pre-processed training data set, collecting the test date 
set from the database, pre-processing the test data set ttt the same manner as was. 
the training data set, testing the trained support vector machine using the pre- 
processed test data set, and in response to receiving the test output: of the trained 
support vector machine, post-processing the tesi output to determine if the test 

30 1 t " ^Pji j s ~ r * 
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communications device for receiving the test data set and the training data set 

i i ! e, the processor may be operable to stot 

training data sec in the storage device prior pre-processing of the training data set 
and to store (he test data set in the storage device prior pre-processing of the test 
5 data set. The i r w tr mav also eomp p u for 

bsj the post-p ' f t, 1 I f th ipiary s 

may further be operable for performing each addit jot hi i described above-. 
i 1 c corns tit <h os de\ >^ roav br furth r ra 1 send a t ^mpuisnonaHy 
derived alphanumeric classifier or other SVM-based raw or post-processed output 
1 0 data to a remote source. 

In an exemplary embodiment, a system and method are provided for 
enhancing knowledge discovery from dam using multiple teaming machines in 
genera! and multiple support vector machines in particular. Training data for a 
learning machine is pre-processed in order to add meaning thereto. Pre- 

15 processing data may involve transforming the data points and/or expanding the 
data points. By adding meaning to the data, the learning machine is provided 
with a greater amount of information for processing. With regard So support 
vector machines m particular, the greater the amount of information that is 
processed, the better generalizations about the data that may be derived. Multiple 

20 support vector machines, each < m pris . ci stinei kernels, tire tramed with the 
pre-processed training data and ate tested with test data that is pre-processed in 
the same manner. The test outputs from multiple support vector machines are 
compared in order to determine which of the test outputs if any represents an 
optima! solution. Selection of one or more kernels may be adjusted and one or 

25 more support vector machines may be retrained and retessed. When it is 
determined that ai ptin oluti has bee achieved, live data is pre-processed 
and input into the support vector machine comprising the kernel that produced the 
optimal solution. The live output from the learning machine may then be post- 
processed {mo a computationally derived alphanumeric classifier for 

30 i t i < L j . < 
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A preferred embodiment comprises methods and systems for detecting 
genes involved with prostate cancer and determination of methods and 
compositions for treatments of prostate cancer. The present invention, 
comprising super > sarra'ng t an.tquesf can asc a' the arrays currently 
5 known $eu'j - irate all 17 BPH vs. 24 G4 without error. Using 

the methods disclosed herein, one BPH was ufentii ted automatical!) as an outlier. 

VMs i sch zero leave t err h at leas few J ■ In the 

space of the wo genes selected by SVMs. norma! samples and dysplasia 
resemble BPH and G3 constitutes a separate cluster front BPH and 04. 

10 

Bri ef Descri ption Of Thk Oh a wing s 

FIG. 1 is a flowchart illustrating an exemplary general method for 
increasing knowledge that may be discovered from data using a learning machine. 
FIG. 2 is a flowchart illustrating an exemplary method for increasing 
15 knowledge that may be discovered from data using a support vector machine. 

FIG, 3 is a flowchart illustrating an exemplary optima! categorization 
method thai may be used in a stand-alone configuration or in conjunction with a 
learning machine for pre-processing or post- processing techniques in accordance 
with an exemplary embodiment of she present invention. 
20 FIG. 4 illustrates an exemplary unexpanded data set that may be input into 

a support vector machine. 

FIG. 5 illustrates an exemplary expanded data set that may be input into a 
support vector machine based on the data set of FIG, 4, 

FIG. 6 ifiusti nes an exemplary data set for tand alo* aj iot i • s 
25 optimal categorization method of FIG. 3. 

FIG. 7 ;s a functional block diagram illustrating an exemplary operating 
environment for an embodiment of the present invention. 

FIG. 8 is a functional block diagram illustrating a hierarchical system of 
o i i p rt w r i , \ 



wo mwrnii 



12 



?CJ7i;S«2/t!2243 



HG. 10 iHustrates an observation graph used to generate {he binary tree of 

FIG. 9. 

FiG. A) 
Separation of this training examples wit n SV'M 8 > f •Tuning 

5 and test examples with the' same SVM. C; Separation of the training examples 
with the basteiine method, D) Separation of the training and test examples with 
the baseline method 

FIG, 3 2 shows graphs of the results of using RFB, 

FIG 23 shows the distribution of gene expression values across tissue 
10 samples for two genes. 

FIG. 14 shows the distribution of gene expression values across genes for 
ai l tissue samples. 

FIG. 15 shows the results of RFg after preprocessing. 
FIG, 16 shows a graphical comparison with the present invention' and '.■the 
15 methods of Goiub. 

FIG. I? shows the results of RPE when training on !0O dense QT„citm 
clusters. 

FIG, 38 shows the results of SVM-RFE when training on the entire data 

set. 

20 FIG, 10 shows the results of Goiub's method when training on the entire 

data set. 

FIG. 20 shows a comparison of feature {gene} seiecfion methods for colon 
cancer data using dil ' e it methods 

flu 1 l t' i » i Ui i es for color: 

25 eancerdata. 

FIG. 22 Shows the metrics of classifier quality. The triangle and square 
curves repn ! ir <• c s I > ons of two ctasse cla I « class and 
class 2 f positive class), 

FiG. 23 shows she performance comparison between SVMs and the 
30 baseline method for leukemia data. 
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FIGS. 24A-24D show the best set of 16 genes for the leukemia data. 
FIG- 25 shows the selection of at- optimum number of genes for leuketftia 

data, 

FIG, 26 is a plot showing she resu'ts based xi I CM data prepa>atior> for 
5 prostate cancer analysis. 

FIG. 27 is a plot graphically comparing SVM-HFB of the present 
invention with ieave~one~out classifier for prostate cancer. 

FIG. 28 graphically compares the Golub and SVM methods for prostate 

cancer. 

10 FIG. 29 Hlustetes the decision functions obtained for Golub (a) and SVM 

methods (b) for the two best ranking genes. 

FIG. 30 is a plot of die teaming curve for a SVM-RFE wi th varied gene 
numbers. 

FIG. 31 is a plot of the learning curve with varied gene number using the 
15 Golub method. 

i>ETA3iE» DESCKtmON 

the present invention provides methods, systems and devices for 
discovering knowledge from data using teaming machines. Particularly, the 
present invention is directed to methods, systems and devices for knowledge 
20 discovery from da fa using learning machines that are provided information 
regarding changes in biological systems. More particularly, the present invention 
comprises methods of use of such know-edge for diagnosing and prognosing 
changes in biological systems such as diseases. Additionally, she present 
invention comprises methods, compositions and devices for applying such 
25 knowledge to the testing and Seating of individuals with changes in their 
individual biological systems. Preferred embodiments comprise detection of 
ge esm oived ros \ i i i tuation foi treatment of 

patients with prostate cancer. 

As ased hi i w o meats s any u derived tr f > 

30 t , o .v ♦ ^ , «u - c 
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microorganisms, viruses, plants and other living organisms. The measurements 
may be made by any tests, assays or observations that are known to physicians, 
scientists, diagnosticians, ot the like. Biological data may include, but is not 
titmiea to, ciinita! tests and observations, phystcaJ and chemical measurements, 
5 genomic- determinations, praieomic determinations, drug levels, hormonal and 
immunological tests, neurochemical or neurophysical measurements, mineral and 
vitamin level determinations, genetic and familial histories, and other 
determinations that may give insight into the state of the individual or individuals 
that are undergoing testing. Herein, the use of the terra "data" is used 
1.0 interchangeably with "biological data". 

While several examples of learning machines exist and advancements ate 
expected in this field, the exemplary embodiments of the present invention focus 
otj the support vector machine. As is known in the art, learning machines 
comprise algorithms that may be trained to generalise using data with known 
1 5 outcomes. Trained learning machine algorithms may then be applied to cases of 
unknown outcome for prediction. For example, a learning machine may; be 
trained to recognise patterns in data, estimate regression in data or estimate 
probability density within data. Learning machines may be trained to solve a 
widi variety <for 1 i s kn * to those >f rdinary skifiin the an \ trained 
20 learning machine mas - ; t .at its output 

is validated within an acceptable margin of error. Once a learning machine is 
trained and tested, live data may be input therein. The live output of a Seaming 
machine comprises knowledge discovered from ail of she training data as applied 
to the live data, 

25 The present invention comprises methods, systems and devices for 

analyzing patterns found in biological data, data such as that generated by 
examination of genes, transcriptional and translahonai products and proteins. 
Genomic information can be found in patterns generated by hybridization 
reactions of genomic fragments and complementary nucleic actus ot interacting 

30 proteins. One o=' the most recent tools for investigating such genomic or nucleic 
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acid interactions is she DNA gene drip or roicroamry. The nricroanray allows for 
the processing of thousands of nucleic interactions. DNA microarrays enable 
researchers to screen thousands of genes in one experiment. For example, the 
microarray could contain 2400 genes on a smaii glass slide and can be used to 
5 determine she presence of DNA or RN A in the sample. Such microarray tests can 
be used in basic research and biomedical research including tumor biology, 
neuroseiences, signal transduction, transcription regulation, and cytokine and 
receptor studies Additionally, there are applications for pharmaceutical drug 
discovery, iarget identification, lead optimization, pharmacokinetics, 

10 pharroacogenomics and diagnostics The market for microarray technology was 
approximately $98 million in I9S>9 and the amount of data generated and stored 
in databases developed from multiple microarray tests is enormous. The present 
invention provides for methods, systems and devices that can use the data 
generated in such microarray and nucleic acid chip tests for the diagnosis and 

15 prognosis of diseases and for the development of therapeutic agents to treat such 
diseases. 

The present invention also comprises devices comprising mieroarrays 
with specific sequence identifying probes that cart be used to diagnose or 
propose the status of or a specific, change in the biological system. Once the 

20 learning machine of the present invention has identified specific relationships 
among the data that are capable of diagnosing or prognosing the current status or 
a change in a biological system, specific devices then incorporate tests for those 
specific relationships. For example, the learning machine of the present 
invention identifies specific genes that are related to the presence or future 

25 occurrence of a change in a biological system, such as the presence or appearance 
of a tumor. Knowing the sequence of these genes allows for the making of a 
specific treating device for those identified genes. For example, a support device, 
such as a chip, comprising DNA. RNA or specific binding proteins, or any such 
combination, that specifically binds so specifically identified genes is usee to 

30 costly identify Individuals having a particular tumor or ihe likelihood of 
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developing the tumor. Additions , > . - t f emifi 

the learning machine: or those thet ate associated with the genes identified by the 
learning, machine, can be determined, for example by using serological tests 
' • . i < ca ?tect» i.t 1 protei 5 T r unsinje 

> antibodies or antibody fragments directed to the proteins or gene products. Such 
tests include, but are not limited to, antibody mtcroarrays on chips, Western 
blotting tests, ELI'S A, and other tests known in the art wherein binding between 
Specific binding partners is used for detection of one of the partners. 

Furthermore, the present invention comprises methods and compositions 
10 for treating the conditions currently existing in a biological organism or 
conditions resulting from changes in biological systems or for treating the 
biological system to alter the biological system to prevent or enhance specific 
conditions. For example, if the diagnosis of an individual includes the detection 
of a tumor, the individual can be treated with anti-tumor medications such as 
15 chemotherapeutic compositions. If the diagnosis of an individual includes the 
predisposition or prognosis of tumor development, the individual may be treated 
prophaiactieaiiy with ebemotherapeutic compositions to prevent the occurrence 
of the tumor. If specific genes are identified with the occurrence of tumors, the 
individual may be treated with specific antisense or other gene therapy methods 
20 to suppress the expression of such genes. Additionally, if specific genes or gene 
products are identified with the occurrence of tumors, then specific compositions 
that inhibit or functionally effect the genes or gene products are administered to 
the individual 1 he 1 

to be construed as limiting the scope of the invention. 

25 Prate omic investigations provide for methods of determining the proteins 

involved in normal and pathological states. Current methods of determining the 
proteome of a person or a population at any particular time or stage comprise the 
use of gel electrophoresis to separate the proteins in a sample. Preferably, 2-D 
gel electrophoresis is used to separate the proteins more completely. 

30 Adu,u wo -k. , „ . *% x vessvu e sei . vv. ,\n prok.ns The 
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proteins may be labeled, for example, with fluorescent dyes, to aid in foe 
termination of the p s in ! t ek - e Patterns of 
separated proteins can be analyzed using the learning machines of the present 
invention. Capturing the gel image cars be accomplished by image technology 
5 methods known t« the an such as densitometry, CCD camera and laser scanning 
and storage phosphor instruments. Analysis of the gels reveals patterns in the 
proteome that are important in diagn - ui i px gi < i ail sgieal Mate 1 - and 
shows changes in relation to therapeutic interventions. 

Further steps of investigating proteoses involve isolation of proteins at 
1 0 specific sites in the gels. Robotic systems fox isolating specific sites are currently 
available. Isolation is followed by determination of the -sequence: and thus, the 
identity of the proteins. Studying the proteome of individuals or a population 
Involves the generation, capture, analysis and integration of an enormous amount 
of data. Automation is currently being used to help manage the physical 
15 manipulations needed for the data generation. The learning machines Of the 
present invention are used to analyze the biological data generated and to provide 
the information desired. 

Additionally, using Modifications of detection devices, such as chip 
detection devices, large libraries of biological data can be generated, Methods for 
20 generating libraries include technologies that use proteins covalently linked to 
their crRNA to determine the proteins made, for example, as rarely translated 
proteins. Such a technology comprises translating m.RNA in vitro and covalently 
attaching the translated protein to the raRNA. The sequence of the mRNA and 
thus the protein is then determined using amplification methods such as PCR. 
25 Libraries containing 10** to !0 ! " members can lie established from this data. 
These libraries can be used to determine peptides that bind receptors or antibody 
libraries can be developed that contain antibodies that avidly bind their targets. 

Libraries called protein domain libraries can be created from cellular 
RNA svhea 1 enure pros n e not i; ted but igmews es a 
30 1 f use bra can be used to & si t » * t 
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Other methods of investigating the protspme do not use gel 
electrophoresis. For example, mass Spectrophotometry can be ased to catalog 
changes in protein profiles and to define nucleic acid expression in normal or 
diseased tissues or ia infectious agents to identify and validate drag and 
5 diagnostic targets. Analysis of this data is accomplished by the methods, systems 
and devices of the preserti invention. Further, technologies such as 2-hybrid and 
2+1 hybrid systems that use proteins to capture the proteins with which they 
interact, currently found in yeast and bacterial systems, generate genome-wide 
protein interaction snaps (FJMs). Large libraries of information such as PiMs can 
10 be manipulated, by the present invention. 

Antibody chips have been developed that can be used to separate or 
identify specific proteins or types of proteins. Additionally, phage antibody 
iibarics can be used to determine protein function. Genomic libraries can be 
searched for open reading frames (ORES) or BSTs (expressed sequence tags) of 
IS interest and: from the sequence, peptides are synthesized. Peptides for different 
genes are placed in % well trays for selection of antibodies from phage libraries, 
the antibodies are then used to locate the protein relating to the original ORFs or 
ESTs in sections of normal and diseased tissue. 

The present mver.fi on can he used to analyze biological data generated at 
20 Duittple stages of investi* iolog ^ nd furl t mt t 

t!»e different kinds of data for novel diagnostic and prognostic determinations. 
For example, biological data obtained from clinical case information, such as 
diagnostic test data, family or genetic histories, prior or current medical 
treatments, and the clinical outcomes of such activities, can be utilized in the 
25 methods, systems and devices of the present invention. Additionally, clinical 
samples such as diseased tissues or fluids, and normal tissues and Raids, and cell 
separations can provide biological data that can be utilized by the current 
invention. Profeomic determinations such as 2-D gel mass spectrophotometry 
danti \ sere .it d abash databases that can be utiliz 
30 the present invention. Genomic databases cm also he used alone or m 
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combination with the above-described data and databases by the. present 
invention to provide comprehensive diagnosis, prognosis or predictive 
capabilities to the user 1 rt nvention 

t. 1 - f ' f i) . - 1 > 1 . nalys b i 

5 data by oj ^ g the data pror ' s to d tin a 

learning machine and/or optionally post-processing the output from a learning 
machine. Generally stated, pie-processing data comprises reformatting or 
augmenting the data its order to allow the learmag machine to be applied most 
advantageously- Jrt a manner similar to pre-processing, post-processing -involves 
10 interpreting the output of a learning machine in order io discover meaningful 
characteristics thereof. The meaningful characteristics to be ascertained from the 
output may be problem- or data-specific. Post-processing involves interpreting 
the output into a form that, for example, may be understood -fey or is otherwise 
useful to a human observer, or convening the output into a form which may be 
15 readily received by another device for. e.g., archival or transmission. 

FIG. I is a flowchart illustrating a general method 100 for analysing data 
using {earning machines. The method 100 begins at starting block 101 and 
progresses to step 102 where a specific problem is formalized for application of 
analysts through machine teaming. Particularly important is a proper formulation 
20 of the desired output of the learning machine. For instance, in predicting future 
performance of an individual equity instrument, or a market index, a learning 
machine is likely to achieve better performance when predicting the expected 
future change rather than predicting the future price level. The future price 
expectation can later be derived in a post-processing; stop as will be discussed 
25 later in this specification. 

After problem formalization, step 103 addresses training data collection. 
Trammg data comprises a set of data points having known characteristics. 
Training data may be collected from one or more local and/or remote sources. 
The eoiieciion of training data may be accomplished manually or by way of an 
30 amomnmd process, such as known electronic data transfer me- boos. Accordingly, 
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an exemplar) embodi went of ti bine for use in conjunction with 

the present invention ma be mpiem« nte d in a netw rke mpt er er v ironmeni. 
Exemplary operating en ■>>> « t )t dimentsof the 

learning machirte - FIGS. 7-8 

5 At step 104. the collected training data is optionally pre-processed in 

order to allow the learning machine to be applied most advantageously toward 
extraction of the knowledge inherent to the training data. During this 
preprocessing stage the training data can optionally be expanded through 
transformations, combinations or manipulation of individual or multiple 
1 0 measures within the records of the training data. As used herein, "expanding 
data" is meant to refer to altering tl in v ma* t) of the input data by changing 
the number of observations available to determine each input poinl (alternatively, 
this could be described as adding or deleting columns within a database table). 
By way of illustration, a data point may comprise the coordinates (1,4$). An 
15 expanded version of this data point may result in the coordinates (1,1 ,4,2,9,3). In 
this example, it may be seen that the coordinates added to the expanded data 
point are based on a square-roof transformation of the original coordinates. By 
adding dimensionality to the data point, this expanded data point provides a 
varied representation of the input data mat is potentially more meaningful for 
20 analysis by a learning machine. Data expansion to this seme affords 
opportunities for learning machines to analyze data not readily apparent in the 
unexpanded training data. 

Expanding data may comprise applying any type of meaningful 
transforaiati t 1 5 the origins v.. ta 

23 The criteria for determining whether a transformation is meaningful may depend 
on the input data itself and/or the type of knowledge that is sought from the data. 
1 - ' 1 e f pes f data trar.sformatiot jelude add n c-i r j , 

labeling; binary c-ortvmsoo, e.g., a bit map: transformations, such as Fburier, 
wavelet, Radon, principal component analysis and kernel pnncipai component 
30 naiysis, as well as clustering; sc v. . in ab&bslistic and $ a 
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analysis; significance testing; strength testing; searching for two-dimensional 
regularities; Hidden Markov Modeling; identification of equivalence relations; 

* ? -i t 1 1 ph tl t 

of vector maps, addition, subtraction, multiplication, division, application of 
5 polynomial equations and other algebraic transformations; identification of 
proportionality; determination of discriminatory power; etc.. In the context of 
medical da a u<fe u\soe,itu>f with 

fct T > dard e< fej rant ;; physi i, ancatiorv, pi 
combinations; biochemical combinations; application of heuristic rules; 
10 diagnostic criteria determinations; clinical weighting systems; diagnostic 
transformations; clinical transformations: application of expert knowledge; 
labeling techniques; application of other domain knowledge; Bayesian network 
knowledge; etc. These and other transformations, as well as combinations 
thereof, will occur to those of ordinary skill in the art. 
15 Those skilled in (he art should also recognize that data transformations 

may be perfcraned without adding dimensionality to die data points. For example 
a data point may comprise the coordinate (A, 8, Q. A transformed version of 
this data point may result in the coordinates (1,2, 3), where the coordinate "1" 
has some known relationship with the coordinate "A," the coordinate "2." has 
20 some known relationship with the coordinate "IV and She coordinate "3" has 
some known relationship wish the coordinate "C" A transformation from letters 
to numbers may be required, for example, if letters are no! understood by a 
learning machine. Other types of transformations are possible without adding 
dimensionality to the data points, even wah respect to data that is originally m 
25 numeric form. Furthermore, it sho • ippreciated tl e-pt essing dan o 
add meaning thereto mas invoivt maly; ti • incomplete, corrupted or otherwise 
"dirty" data. A learning machine cannot process "dirty" data in a meaningful 
manner. Thus a pre-processing step may involve cleaning up or filtering a data 
settnords > emove rcpait t e ac«di > lata p its 
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Resuming to FIG. U an exemplary method 300 continues at step 106, 
where she teaming machine ;s trained using the pre-processed data. As is known 
in the an, a learning machine is trained by adjusting its operating parameters until 
a desirable training output is achieved. The determination of whether a training 
5 output is desirable may be accomplished either manually or automatically by 
comparing the training ou#ut to the. known characteristics of the training data. A 
learning machine is considered to be trained when its training outpur is within a 
predetermined error threshold from the known characteristics of the training data, 
in certain situations, it may be desirable, if not necessary, to post-process the 
10 training output of the teaming machine at step 107, As mentioned, post- 
processing the output of a learning machine involves interpreting the output into a 
mean ingful form. In the contest of a regression problem, for example, it may be 
necessary to determine range categorisations for the output of a learning machine 
in order to determine if the input data points were correctly categorized, in the 
15 example of a pattern recognition problem, it is often not necessary to post-process 
the training output of a learning machine. 

At step 108, test data is optionally collected in preparation for testing the 
trained learning machine. Test data may be collected from one or more local 
and/or remote sources. In practice, test data and training data may be collected 
20 from the .same source(s) at the same time. Tims, tern data and training data sets 
can be divided out of a common data set and stored in a local storage medium for 
use as different input data sets oi 1 1 in im t n k s.ordtess of how the test 
data is collected, any test data used must be pre-processed at step 1 10 in the same 
manner as was the training data. As should be apparent to those skilled in the art, 
25 a proper test of the learning may only fee accomplished by using testing data of 
the same format as the training data. Then, at step 1 12 the teaming machine is 
tested using the pre-processed test data, if any. The test output of the learning 
machine is optionally post -processed at step 114 in order to determine if the 
results are desirable. Again, the post processing step mvolves interpreting the test 
30 output nit- , meaningful form The meaningful to;; • ; e em that s ; , teadlis 
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understood by a human or one that is compatible with another processor, 
Regardless, the tesr output must be post-processed into a form which may be 
compared to the test data to determine whether the results were desirable. 
Examples of post-processing steps Include but are not limited of the following: 
5 optima! categorization determinations, scaling techniques (linear and non-linear), 
transformations (linear and non-linear), and probability estimations. The method 
100 ends at step 116. 

FIG 2 1» a fi w chart illustrating an k > t i i enhancing 

knowledge that may be discovered from data using a specific type of learning 
10 machim kno v ts juppon ect i ^ Mi \ SVM mpicmeu j 

specialized algorithm for providing genera v«u >t a, hen estimating a multi- 
dimensional function from a limited collection of data. A SVM may be 
particularly useful in solving dependency estimation problems. More 
specifically, a SVM may be used accurately in estimating indicator functions {e.g. 
15 pattern recognition problems) and real- valued functions (e.g. function 
approximation problems, regression estimation problems, density estimation 
problems, and solving inverse problems). The SVM was originally developed by 
Vladimir N. Vaprnk. The cohorts underlying the SVM are explained in detail in 
bis book, entitled Statistical leaning Theory (John Wiley & Sons. Inc. 1998), 
20 which is herein incorporated by reference in its entirety. Accordingly, a 
familiarity with SVMs and die terminology used therewith are presumed 
throughout this specification. 

The exemplary method 200 begins at starting block 201 and advances to 
step 202, where a problem is formulated and then to step 203. where a training 
25 data set is collected. As was described with reference to FIG. i, training data 
may be collected from one or more local and'or remote sources, through a manual 
or automated process. At step 204 the training data is optionally pre-processed. 
\ air pr< pt k prises »i t ng mesa »f « thin do tramme data 

by cleaning the data, mmsibrming the data and/or expanding site- data. Those 
30 skilled its the art should appreciate that SVMs are capable of processing input 
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data having extremely large dimensionality. In fact, the larger the dimensionality 
of the input data, the better the generalizations a SVM is able to calculate. 
Therefore, while training data transformations are possible that do not expand the 
training data, in the specific context of SVMs it is preferable that training dais be 
5 expanded by adding meaningful information there so. 

At step 306 a kernel is selected for the SVM, As is known in the art, 
different kernels wits cause a SVM to produce varying degrees of quality in the 
outpt for a en t of input dat [tier I s son of an appropriate 
kernel may be essentia! to the desired quality of the output of the SVM. to one 
10 embodiment of the learning machine, a kernel may be chosen based on prior 
performance knowledge. As is known in the art, exemplary kernels include 
polynomial kernels, radial basis classifier kernels, linear kernels, etc, to an 
alternate embodiment, a customized kernel may be created that is specific to a 
particular problem or type of data set, to yet another embodiment, the multiple 
15 SVMs may be trained and tested simultaneously, each using a different kernel. 
The quality of the outputs for each simultaneously earned and tested SVM may 
be compared using a variety of selectable or weighted metrics (see step 222} to 
determine the most desirable kernel. 

Next, at step 208 the pre-processed training data is input into the SVM, 
20 At step 210, the SVM is trained using the pre-processed training data to generate 
at! optimal hyperpiane. Optionally, the training output of the SVM may then be 
post-processed at step 2J1. Again, post-processing of training output may be 
desstabie, or even necessary, at this point in order to properly calculate, ranges or 
categories for the output. At step 212 test data is c rly to previous 

25 descriptions of data collection. The test data is pre-processed at step 214 in the 
same manner as was the training data above. Then, at step 216 the pre-processed 
test: data is input into the SVM for processing In order to determine whether the 
SVM was trained in a desirable manner. The test output is received from the 
SVM at step 21 8 and Is optionally post- processed at step 220. 
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Based on the pest-processed test output, it is determined at step 222 
whether an optimal minimum was achieved by the SVM, Those skilled la the art 
should appreciate that a SVM is operable to ascertain an output having a global 
minimum error. However, as mentioned above, output results of a SVM for a 
5 giver! dais set will typically vary with kernel selection. Therefore, toe are in 
fact multiple global minimum* 'hat may he ascertained by a SVM for a given set 
of data.. As used herein, the term "optimal minimum" or "optimal solution" 
refers :e u s hat is eons >t ,t.u (e g he 

optimal solution for a given set of problem specific, pre-established criteria) 
.SO when compared to other global mimmums ascertained by a SVM:. Accordingly, 
at step 222, deterafimng whether the optimal minimum has been ascertained may 
involve comparing the output of a SVM with a historical or predetermined value, 
Such a predetermined value may be dependant on the test data set. For example, 
in the context of a pattern recognition problem where data points are classified by 

15 a S VM as either having a certain characteristic or not having the characteristic, a 
global minimum error of 50% would not be optimal, in this example, a global 
minimum of 50% is no tetter than the result that would be achieved by flipping a 
coin to determine whether the data point had that characteristic. As another 
example, in the case where multiple SVMs ate trained and tested simultaneously 

20 with \ arying ke >uts for each SV M may be compared with output of 

other SVM to determine the practical optimal solution for that particular set of 
kernels. The determination of whether an optimal solution has beer* ascertained 
n ,o vr! r ed man i r through an aut a cor pai m proces 

!f it is determined that the optimal minimum has not been achieved by the 

25 trained SVM, the method advances to step 224, where the kernel selection is 
adjusted- Adjustment of the kernel selection may comprise selecting one or more 
new kernels or adjusting kerne! parameters. Furthermore, in the case where 
t ultipte SVMs were tra e v. lsI ice e< k niels ma\ oe 

replaced or modified, while other kernels may be re-used for control purposes 

30 After the kerne; selection is adjusted, the method 200 is repeated from step 208, 
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vhere the pj ces 1 ' i >dst< npet *.Bfo the SVM for tra t purposes 
When it is determined at step 222 that the optimal minimum has been achieved, 
the method advances to step 226, where live data is collected similarly as 
described above. By definition, Jive data has not beers previously evaluated, so 
5 that the desired output characterises that were know a with respect to she framing 
data arid the jest , 1 e i< >t known. 

At step 228 the live data is pre-processed in the same manner as was the 
training data and the test data. At step 230, the live pre-processed data is input 
into the SVM for processing, 'file live output of the SVM is received at step 232 

10 and is post-processed at step 234. In one embodiment of the learning machine; 
pcst-processing comprises converting the output of the SVM into a 
computationally-derived alpha-numerical classifier for interpretation by a human 
or computer. Preferably, the alphanuraerscai classifier comprises a single value 
that is easily comprehended by the human or computer. The method 200 ends at 

15 step 236, 

FIG, 3 is a flow chart illustrating an exemplary optimal categorization 
method 300 that may be used for pre-processing data or post-processing: output 
from a learning machine. Additionally, as will be described below, the 
exemplary optimal categorization method may be used as a stand-alone 

20 categorization technique, independent from learning machines. The exemplary 
optimal categorization method 300 begins at starting block 301 and progresses to 
step 302, where an input data set is received. The input data set comprises a 
sequence of data samples from a continuous variable. The data samples fail 
within two kj{ > i \ j ,ril, 

25 tracking variables are initialized. As is known in the art, bin variables relate to 
resolution, while class-tracking variables relate to the number of classifications 
within the data set. Determining the values for initialization of the bin and class- 
tracking variables may be performed manually or through an automated process, 
such as a computer program for analyzing the input data set. At step 306, the 

30 dat i a. s UviU'Uv tt<.5 
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measures the uncertainty of a random distribution, '** the. exemplary method 300, 
, ot ,p, is a, ; t . ai,?e i\- J. adifo-js of the input \a.-alve so thai raaimam 
issificatt pahaiity ; »ved 

The method 300 produces a series of 'cuts 1 ' on the continuous variable. 
5 such that the continuous variable may be divided into discrete categories. The 
cuts selected by the exemplary method 300 are optimal in the sense that the 
average entropy of each resulting discrete category is minimized. At step 308, a 
determination is made as to whether ail cuts have been placed within input data 
set comprising the continuous variable. If all curs have not been placed, 
10 sequential bin combinations are tested for cutoff determination at step 310. From 
step 310, the exemplary method 300 loops back through step 306 and returns to 
step 308 where it is again determined whether all cuts have been placed Withm 
input data set comprising the continuous variable. When all cuts have been 
placed, the entropy for the entire system is evaluated at step 309 and compared to 
15 previous results from testtngmore a' fewer cots. If it cannot he concluded that a 
minimun entropy tate has b ctgrtmned, then other p > n sel 
must be evaluated and the method proceeds to step 311. From step 311 a 
heretofore untested selection for number of cuts is chosen and the above process 
is repeated from step 304. When either the limits of the resolution determined by 
20 the bin width has been tested or the convergence to a minimum solution has been 
identified, the optimal classification criteria is output at step 352 and the 
exemplary optimal cat* g. ) izath i method 300 ends at step 314, 

The optimal categorization method 300 takes advantage of dynamic 
programming techniques. As is known in the art. dynamic programming 
25 teci is gr c it i i 

complex, p 5 \ c ee redundant 

calculations in 'be \ rw rd approach 

of exhaustively searching through all possible cuts in the continuous variable data 
would result in an algorithm of exponential complexity and would reader the 
30 ->f I , vo % i \ t v. , t s „> !fi > ^ i^^fie 
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additive property of the target function, m this problem the average entropy, the 
problem may be divide into a series of sub-problems. By properly formulating 
algorithmic sub-structures for solving each, sub-problem and storing the solutions 
of the sub-problems, a significant amount of redundant computation may be 
5 ' e ratified md avoided. As a result of < i i p ( i 

the exemplary optima! categorizatfo& method 300 may be implemented as an 
algorithm having a polynomial complexity, which may be used to solve large 
si;;ed problems. 

i n ' i 1 h piai iri ' § r z i 

10 may be used in pre-processing data and/or post-processing the output of a 
learning machine; For example, as a pre-processing transformation step, she 
exemplar} >pi;mai ca < iuaiieti method 300 may be used to extract 
classification information from raw data. As a post-processing technique, the 
exemplary optima! range categorisation method may be used to determine the 

15 optimal cut-off values for markers objectively based on data, rather than relying 
on ad hoc approaches. As should be apparent, the exemplary optima! 
categorisation method 300 has applications in pattern recognition, classification, 
regression problems, etc. The exemplary optimal categorization method 300 may 
also be used as a stand-alone categorisation technique, independent from SVMs 

20 and other learning machines. 

FIG. 4 illustrates an exemplary nnexpanded data set 400 that may be used 
as input for a support vector machine. This data .set 400 is referred to as 
"unexpanded' because no additional information has been added thereto. As 
shown, the unexpanded data set comprises 3 training data set 402 and a test data 

25 m 404, Both the unexpanded training data set 402 and the unexpanded test data 
set 404 compi<,- .- p ints Is ex< ip ir> dat Dint +<€ relating to 
historical clinical data from sampled medical patients. In this example, the data 
set 400 may be used to train a SVM to determine whether a breast cancer patient 
wis! experience a recurrence not. 



Each data point inctm put l iimensions, and an 

output classification shown as 406a~f which represent medical data collected for 
each patient. In panic >\ ' the second 

coordinate 406b represents "Estrogen Receptor Level" the third coordinate 406c 
5 represents "Progesterone Receptor Level'" the fourth coordinate 406d represents 
"Total Lymph Nodes Extracted," the fifth coordinate 406c represents "Positive 
(Cancerous) Lymph Nodes Extracted, " and the output classification 406f„ 
represents the "Recurrence Classification." The important known characteristic 
of the data 400 is the output classification 406f (Recurrence Classification). 

10 which, in this example, indicates whether the sampled medical patient responded 
to treatment favorably Without recurrence of cancer or responded to 

treatment negatively with, recurrence of cancer CT'). This known characteristic 
will be used for learning while processing the training data in the SVM will be 
used in an evaluative fashion after the test data is input into the SVM thus 

13 creating a "blind'* test, and will obviously be unknown in the live data of current 
medical patients. 

Table 1 provides an exemplary test output from a SVM trained with the 
unexpanded training data set 402 and tested with the unexpanded data set 404 
shown in FIG. 4, 

20 

[ Vapnik's polvmimi&l 1 

Aiphas i.K.ramieo' up to 1000 
i f»pui vaiua wiii be individual scatec! to tie! 

I between 0 a»<j J 

25 | SV-«s» thresheM: Je-tg ! 

| Margin threshold: 0.1 

| Objective sere toic ranee; !«-- !? J 

Degree of poiynomiafc 2 

| Test set: 

30 ; Te-ta! samples: 24 

| Positive saiwpjes; $ ! 

| Palsa negatives: 4 | 

i Negative sajnpte;>: 56 s 

1« ""■ - — i 
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The test output has been pdsHpiocessed so be comprehensible by a htaaatt 
or computer. According so the sable, the test output shows that 24 total samples 
(data points) were examined by the SVM and that the SVM incorrectly identified 
(our of eight positive samples (50%), i.e., found negative for a positive sample, 
5 and incorrectly identified 6 of sixteen negative samples (37.5%), i.e., found 
positive for it negative sample. 

FIG. 5 illustrates an exemplary expanded data set 600 that may be used as 
input for a support vector machine. This data set 600 is referred to as ' expanded" 
because addnionai information has been added thereto. Note that aside from the 
10 added information, the expanded data set 600 is identical to the unexpanded data 
set: 400 shown in F!G. 4. The additional information supplied to the expanded 
data set has been supplied using the exemplary optimal range categorization 
method 300 described with reference to HG 3. As shown, the expanded data set 
comprises a training data set 602 and a test data set $04. Both the expanded 
15 training data set 602 and the expanded test data set 604 comprise date points, 
saefo as exemplary data point 606> renting to historica! data from sampled 
medical patients. Again, the data set 600 may be used to train a SVM to leans 
whether a breast cancer patient will experience a recurrence of the disease. 

Through application of the exemplary optimal categorization method 300, 
20 each expanded data point includes twenty coordinates (or dimensions) 606a 1-3 
through 606ei-3, and a« output classification 606f. which collectively represent 
medical data and categorization transformations thereof for each patient. Its 
particular, the first coordinate 606a represents "Age," the second coordinate 
through the fourth coordinate 606a! - 606a3 are variables that combine to 
25 represent a category of age. For example, a range of ages may be categorized, for 
example, into '"young" "middle-agec arid "old categoric 
of ages present in the data. As shown, a string of variables "0" (606al), "0" 
(6(i6a2}, (606a3) may he used to indicate that a certain age value is 
categorized as "old" Siimiariy a string of variables "0 ,: (606al), "1" (§06a2), 
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"middle-aged." Also, a string of variables "1" (606a t), "0" (606a2), "0" (606a i) 
ty bt - v i lio a v i > i t 

art inspection of FIG. 6, it may be seen that ihe optima! categorisation of the 
ranged \>t 6t)6a ains us jg the c-xetm. , thod 300, v * e f 
5 be 31-33 a "young" 34 * "middle-aged" and 35-49 ~ "old," The other 
coordinates, namelv coordinate jVi Esf i ept rs Level coordinate 
606c "Progesterone Receptor Level," coordinate 606d "Total Lymph Nodes 
Extracted," and coordinate ot)6e "Positive (Cancerous) Lymph Nodes Extracted," 
have- each been optimally categorized in a similar manner. 
10 Table 2 provides an exemplar)' expanded test output from a SVM trained 

with the expanded training data set 602 and tested with the expanded data set 604 
shown in FJG. 6, 



t'sJPelyrHJfniai 
Alphas bounded up to ! 000 
input- values will be SniitvUSuatly scaled i 
■a (i and } 
SV aem thfestol<ii Se-tS 
kfaxgii threshold 0,1 
Objective sero toJers«ce: ie-i? 
De&»e«)f polynomial: 2 



Tesrsei; 




Total samples: 


24 


Positive samples: 


8 


False negatives: 


4 


Negative samples: 


IS 


Paise positives; 


4 



Table a 



Ihe expanded test rutpu' h< t he tsible by a 

human or computer. As indicated, the expanded test output shows that 24 total 
samples {data points) were examined by the SVM and that ihe SVM incorrectly 
identified four of eight positive samples (-50%) and incorrectly identified foot of 
sixteen negative samples (25%). Accordingly, by comparing ibis expanded test 
output wish the unexpended test output of Table I, it may be seen dint the 
xpvms s , \ leads nprovt a tow 



WO C!2/OS'«22 



32 



PCT/i;S't>2/02243 



minimum erros specifics)!} a reduced instance of patients who would 
.unnecessarily be subjected to follow-up cancer treatments. 

FIG, 6 illustrates an exemplary Input and output for a stand alone 
pp - 'i the optm - ■ t >rij i ion meth >00 te cf bed in Fi< 1 ' I 
5 exampte of FIG. 6, the Input data set 80 i comprises a "Number of Positive 
Lymph Nodes 502 nd rrespondint jn\r>,. CiassiJ o b04 Its this 
a; mpk r ' optimal jtegon ti n met! id KH has bees at ied t the aput 
data set SOI in order to locate the optima! cutoff point for determination of 
treatment for cancer recurrence, based solely upon the number of positive lymph 

10 nodes collected in a postsurgical tissue sample. The welt-known clinical, 
standard is to prescribe treatment for any patient with at least three positive 
nodes. However, the optimal categorization method 300 demonstrates that the 
optimal cutoff , seen in Table 3. based upon the input data S01. should be at tire 
higher value of 5 5 lymph nodes, which corresponds to a clinical rule prescribing 

1 5 follow-up treatments in patients with at least six positive lymph nodes. 
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As show!) in Table 4 below, the prior art accepted clinical cutoff point (> 
3.0} resulted In 47% correctly classified recurrences and 71% correctly classified 
non-recurrences. 
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j Cut Perm 


Cotreetty Classified 
tearteijce 


Zom it) iss i t No 
Recurrence 1 




7ett\$<4m) 




Optimal 








Table 4 



Accordingly, 53% of the recurrences were incorrectly classified (further treatment 
5 was improperly not recommended) and 29% of the non-recurrences wer*. 
incorrectly classified (further treatment was. incorrectly recommended). By 
contrast, the cutoff point determined by the optimal categorisation method 300 (> 
5.5) resulted in 33% correctly classified recurrences and 97% correctly classified 
non-recurrences. Accordingly, &?% of the recurrences were incorrectly classified 
10 (further treatment was improperly sot recommended) and W of the floa- 
reeunences were incorrectly classified (further treatment was incorrectly 
recommended). 

As shown by this example, it may be feasible to attain a higher instance of 
correctly identifying those patients who can avoid the post-surgical cancer 

15 treatment regimes, using the exemplary optimal categorization method 300, Even 
though the cutoff point determined by the optimal categorization method 300 
yielded a mod l j emage oi inc rrc ( dassttu-d recurrences, it 

yielded a significantly lower percentage of incorrectly classified non-recurrences. 
Thus, considering the trade-off, md - the a tl pt t 

20 problem was the avoidance of unaecessary treatment, the results of the cutoff 
point defcrmirsed by the optimal categorisation method 300 are mathematically 
superior to those of the prior art cluneal cutoff point. This type of information is 
potet t aii\ exiremel) useful m t s gh ng 

the choice between undergoing treatments such as chemotherapy or risking a 

25 recurrence of brass-, cancer. 
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Table 5 is a comparison of exemplary post-processed output from a first 
support vector machine comprising a linear kernel and a seeoed support vector 
machine comprising a polynomial terse!. 



l> Simple 0ot Product 


ti Vapmk's PobfiOR >i 


Up!s oundsd up « >n 
!fijx:: vaiufis will no! be scaled. 
SV ffiro threshold te-Jt? 
Margin rbresfteJtf- 0 S 
Objective ?.er;.< totefancs: 5e-0? 


Alphas ix i i 

Inpm values wiil hot be seated. 

SV threshold: is- 56 

Margin threshold- 0. 1 

' ' iv< ?.eri ei'u ie-0? 

- 


3££L$£t 

Pasirtvc saropfcs- 8 
False twg3ibes: 6 
Negative samples: IS 
Faise positives: 3 


XasLses 

Tu> .,f. 24 
Pasifivc sampies: S 
False negatives: 3 
Negative sampies: !C» 
False positives.' 4 



5 Tables 

Table 5 demonstrates that a variation in the selection of a kernel may affect the 
level of quality of the output of a SVM. As shows, the post-processed output of a 
first SVM {Column 1} comprising a linear dot product kernel, indicates that for ft: 
given test set of twenty four samples, six of eight positive samples were 

10 incorrectly identified and three of sixteen negative samples were Incorrectly 
identified. By way of comparison, the post-processed output for a second SVM 
(Column H) comprising a polynomial kernel indicates that for the same test set, 
only two of eight positive, samples were incorrectly identified and four of sixteen 
negative samples were identified. By way of comparison, the polynomial kernel 

.15 yielded significantly improved results sustaining to the identification of positive 
samples and yielded only slightly worse results pertaining to the identification of 
negative samples. Thus, as will he apparent to those of skill in the art, the global 
minimum error for the polynomial kernel is lower than the global minimum error 
for the linear kerne! for this data set. 
20 FIG. 7 and the following discussion are intended to provide a brief and 

general description of a suitable computing environment for implementing 

Magical data ana r r u >tio«. Alt he sys 

shown ta FIG. 7 is a ceavemionai persona! computer 1000 : those skilled it; the art 
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v. < > ' i ' 1 rnerned hei 

computer system configurations. The computer 1000 includes a central 
processing unit 1022, a system memory 1020, and an Input/Output ("I/O") bus 
02: bus {02! the cs t t i he syster 

5 memory 1020. A bus controller 1023 controls the flow of data on the h'O bus 
1026 and between the central processing unit 1022 and a variety of internal and 
external 1/0 devices. The I/O devices connected to the I/O bus 1026 may have 
direct 'access to the system memory 1020 using a Direct Memory Access 
("DMA") controller 1024. 
1.0 The I/O devices are connected So the h'O bus 1076 via a set of device 

interfaces. The device interfaces may include both hardware components and 
software components. For instance, a hard disk drive 1030 and a floppy disk 
drive 1032 for reading or writing removable media 1050 may be- connected to the 
i/O bus 1026 through disk drive controllers 1040. An optical disk drive 1034 for 

15 reading or writing optical media 1052 may be connected to the I/O bus 102(5 
using a Small Computer System Interface ("SCSI") 104 J, Alternatively, an IDE 
(Integrated Drive Electronics, i.e., a hard disk drive interface for PCs), ATAPf 
(ATtAchment Packet Interface, i.e., CD-ROM and tape drive interface), or HIDE 
(Enhanced IDE) interface may be associated with an optical drive such as may be 

20 the case with a CD-ROM drive. The drives and their associated coropBter« 
readable mcu a prov ter 1000 In add ti. n to 

the computer-readable media described above, other types of computer-readable 
media may also be used, such as ZFP drives, or (he like. 

A display device 1053, such as a monitor, is connected to she VQ bus 

25 1026 vis another interface, such as a video adapter 1042. A parallel interface 
1043 connects s chroi peripl - << ae . we<. vu<,h a laser printer 1056. to the 
I/O bus 1026. A serial interface 1044 connects communication devices to the VO 
1 i ( \ t \ i ) h omputer 1000 

via the sena? interface 1044 or by using an input devjee. suclt as a keyboard 1038, 

30 a moase 1036 or a modem 105? uthsr peripheral devices (nor shown) may also 
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fee connected to the computer iOOO, such as audio input/output devices or image 
capture devices 

A number of program modules may be stored on the drives and in the 
system memory 1020. The'- -system 'memory 1020 can include both Random 
5 Access Memory ("RAM") and Read Only Memory ("ROM"), The program 
modules control how the computer 1000 functions and interacts with the user, 
with I/O devices or with other computers. Program modules iiieiude routines, 
operating systems 1065, application programs, data structures, and other software 
?! 1 mwar corap< sen! In an lustradve e limem, it irnin ma ! k 
10 may comprise one or more pre-processing program modules 1075a, one or more 
post-processing program modules. 1Q75B, and/or one or more optimal 
categorization program modules 1077 and one or more SV'M program modules 
1070 stored on the drive* or xn the system memory 1020 of the computer 1000. 
Specifically, pre-processing program modules 1075A, post-processing program 
IS modules 1075B, together with the SVM program modules 1070 may comprise 
computer-executable instructions fox pre-processing data and post-processing 
output from a teaming machine and implementing she learning algorithm 
according to the exemplary n 1 r ace fo HQS. 1 and 2. 

Ftmhermere, optimal categorization program modules 1077 may comprise 
20 wumput x i l I c ' s q\ [ u !> jt f r/nj. \ h< set according 

to the exemplary methods described with reference so HC5. 3. 

The computer 1000 may operate in a networked environment using 
logical connections to one or more remote computers, such as remote computer 
1060, The remote computer 1060 may be a server, a router, a peer device or 
25 other common network node, and typically includes many or all of She '■elements' 
described in connection with the computer 1000. In a networked environment, 
program modules and data may be stored on the remote computer 1060, The 
logical connections depicted in FIG. 8 include a local area network ("LAN") 1054 
and a wide area network ("WAN") 1055. In a LAM environment, a network 
30 sntcrtace 1043, such, as an Fihcroe, adapter card, can bo usee to connect she 



wo mmmi 



37 

computer 3 CXK) to the remote computer 1060. in a WAN environment, the 
•00 may u i device such as a modem 105?, to 

,i 1 f * cc i. It sviU be appreci3 1 i < l t * 

are illustrative and other devices of establishing a communications hnk between 
5 the computers may be used. 

In another embodiment, a plurality of SVMs can be configured to 
i I ' 1 c > is i a ^ i In pancculai 

one or more first-level SVMs may be trained and tested to process a first type of 
data and one or more first- level SVMs can be trained and tested to process a 

10 second type of data. Additional types of data may be processed by other first- 
ievel SVMs. The output from some or alt of the first-level SVMs may be 
combined its a logical manner to produce an input data set for one or more 
secoraMeve-l SVMs. in a similar fashion, output from a plurality of second-level 
SVMs may be combined in a logical manner to produce input date for one or 

.1 5 more third-level SVM. The hierarchy of SVMs may be expanded to any number 
of levels as may be appropriate, fa -this- manner, lower hierarchical level SVMs 
may be used to pre-process data that is to be input into higher level SVMs. Also, 
higher hierarchical level SVMs may be used to post-process data that is output 
host; lower hierarchical level SVMs. 

20 Each SVM in the h ;hy or « sera cal !e SVMs a be 

configured with a distinct kernel. For example, SVMs used to process a first type 
of data may be configured with a first type of kerne! while SVMs used to process 
a second type of data may utilize a second, different type of kernel. In addihon, 
multiple SVMs a the same >r diffej et hica level may be confi| ed 

25 process the same type of data using distinct kernels. 

FIG. 8 illustrates an exemplary hierarchical system of SVMs. As shown, 
one or more first-level SVMs 1302a and 1302b may be trained and tested to 
process a first type of input data i 304a, such as mammography data, pertaining to 
a s; « * ,- ned>ca , > f its hie n n ci hese SVMs «m> comprise s 

30 distinct kernel, iadicated as "KERNEL 1" and "KERNEL 2". Also, one or more 
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additional first-level SY&& 1302e and I302d may be trained and tested to process 
a second type of data 1304b, which may be, foi ex np • genomic data for the 
sam« or a c \ in one: u <i ,<rc oi \ ; 

additional SVMs ma v. < ti Jj cased a KXRNFL 1" and 

5 "KERNEL 3". the output from each of the like first-level SVMs may be 
compared with each other, e.g., !306a compared with 1306b; 1306c compared 
wish 13064 in order to determine opt ma! outputs i *08a , i 308b The:', ihe 
optima! outputs from the two groups or first-level SVMs, i.e., outputs 1308a and 
1308b, may be combined to form a new multi-dimensional input data set 1310, 
It) for example, relating to mammography and genomic data. The new data set may 
then be processed by one or more appropriately trained and tested second-level 
SVMs 1312a and 1312b. The resulting outputs 1314a and 1314b from second- 
level SVMs 1312 a and 1312b may be compared to determine an optimal output 
1316. Optima! output 13 Ui may identify causa! relationships between the 
1 5 mammography and genomic data points. As should be apparent to those of skill 
in the art, other combinations of hierarchical SVMs may be used to process either 
in parallel or serially, data of different types in any bold or industry m which 
analysis ofdaia is desired. 

The' ■problem of selection of a small amount of data from a large data 
20 source, such as a gene subset from a microartay, is particularly solved using (he 
methods, devices and systems described herein. Previous attempts to address this 
problem used correlation techniques, i.e., assigning a coefficient to the strength of 
association between variables. Preferred methods described herein use support 
vector machines methods based on recursive feature elimination (RFE). In 
23 examining genetic data to find determinative genes, these methods eliminate gene 
redundancy automatically and yield better and more compact gene subsets. The 
methods, devices and systems described herein can be used with publically 
available data to find relevant answers, such as genes determinative of a cancer 
diagnosis, or with specifically generated data. 
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•There are many di>' rent i ;th i to a\s' ..• nge (lata sources The 
fofiowing exan ed c ex ' < uiatiotss though 

any data ean be used i« the methods, systems and devices described herein. 
Tltere are studies of gene clusters discovered by unsupervised or supervised 
3 I; in i s* i jf the art 

classification algorithms, SVMs, in determining a small subset of highly 
diseriminat genes that cm be used to build very reliable cancer classifiers, 
Identification of discriminant genes is beneficial in confirming recent discoveries 
in research or in suggesting avenues for research or treatment, Diagnostic tests 
10 that measure the abundance of a given protein in bodily fluids may be derived 
from the discovery of a small subset of discriminant genes. 

The examples provided below demonstrate the effectiveness of SVMs m 
discovering informative features or attributes. Use of SVMs and the methods 
herein, are qualitatively and quantitatively advantageous when compared with 
15 oiiher methods. 

hi classification methods; using SVMs, the inpui is a vector referred to as a 
"pattern" of n components referred to as "features", F k defined as the n- 
dimensional feature space. In the examples given, the features are gene 
expression coeffic ti and t e patterns correspond !o patients. While the present 
20 discussion is directed to two-class classification problems, this is not to limit the 
scope of the invention. The two classes arc identified with the symbols i» and (- 
). A (raining set of a number of patterns ix (> x 2l .,,. K k> f <.,x £ } with known class 
labels {>■, , y z , ...>*,... ,y f } . y k e \ IM), is given. The training patterns are used 
to build a decision function (or discriminant function) D(\) y that is a scalar 
25 function of an input pattern x. New patterns are classified according to the sign 
of the decision function: 
D(x)>0*x« class (f); 



D(x) <0» xs eiass (-}■ 
i)t,K ; - -if. decision boundary; 
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w here e means "is a member of < 

Decision boundaries that ate simple weighted sums of the training patterns pi us a 
bias are referred to as "linear discriminant functions". Herein, 

where w is the weight vector and b is a bias value, A data set is said to be 
linearly separable if a linear discriminant function can separate it without error. 

A common problem it, das tcafi a md machine learning - gene; . - 
the reduction of dimensionality of feature space to overcome the risk of 
"overfitimg" Data overfitting arises when the number n of features is large, such 
as the thousands of genes studied In a microarray, and the number of training 
patterns is comparaijyeiy small, such as a few dozen patients, in snch situations, 
one can find a decision function that separates the training data, even a linear 
decision function, but it will perform poorly on test dm. Training techniques 
that use regultiraabon, i,e., restricting the class of admissible solutions, can avoid 
overfitting the data without requiring space dimensionality reduction. Support 
Vector Machines (SVMs) use regaiaratation, however even SVMs can benefit 
from space dsrnensio r< ducnon. 

Another method of feature reduction is projecting on the first few 
principal directions of the data. Using this method, new features are obtained that 
are linear combinations of The original features. One disadvantage of projection 
methods is that none of i»e original input features can be discarded. Preferred 
methods incorporate pruning techniques that eliminate some of the original input 
features while retaining a minimum subset of features that yield better 
classification performance. For design of diagnostic tests, it is of practical 
importance to be able to seiect a small subset of genes for cost effectiveness and 
to permit the relevance of the genes selected to be verified more easily. 

The problem of feature selection is well known in partem recognition. 
Given a particular classification technique, one can select the best subset of 
features satisfying a given "model selection" criterion by exhaustive enumeration 
of ail subsets of features. However, this method is impractical for large numbers 
of features, such as thousands of genes, because ot the combinatvrmi explosion of 
the number of subsets. 
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Given the foregoing difficulties, feature selection in large dimensional 
input spaces Is performed using greedy algorithms. Among various possible 
methods, imm ranking techniques are particularly preferred. A fixed number 
of top ranked features may he selected for further analysis or to design a 
5 classifies. Alternatively, a threshold can be set cm the ranking criterion. Only the 
features whose criterion exceed the threshold are retained. A preferred method 
uses the ranking to define nested subsets of features, Fj c Fjc - - - c f, and 
select art optimum subset of features with a model selection criterion by varying a 
single parameter: the number of features. 

10 Errorless separation can be achieved with any number of genes greater 

than one. Preferred methods comprise use of a smaller number of genes. 
Classical gene selection methods select the genes that individually best classify 
the training data. These methods include correlation methods and expression 
ratio methods. While the classical methods eliminate genes that are useless for 

15 discrimination (noise), they da not yield compact gene sets because genes are 
redundant Moreover, complementary genes that individually do not separate 
well are missed, 

A simple feature (gene) ranking can foe produced by evaluating how well 
an individual feature contributes to the separation (e.g. cancer vs. normal), 
20 Various correlation coefficients have been used as ranking criteria, See, e.g., 
T.K, Goiub, et a!, "Molecular classification of cancer: Class discovery and class 
prediction by gene expression monitoring", Science 286, 551-37 {1999}. The 
coefficient used by Golub et a;, is defined as: 

25 w, = (*t,{+) - M-Mm + ) + o,(-)} (2) 

where ^and or; are the mean and standard deviation, respectively, of the gene 
expression values of a particular gene i for all the patients of class (+) or class (--), 
' ~ l - •»•• Urge positive w ; values indicate strong correlation with class {■<) 
30 whereas large negative w ( - values indicate strong correlation with class {-), The 
method described by Golub, et al. for feature ranking is to select an equal number 
of genes with positive and with negative correlation coefficient. Other methods 
use the absolute value of w ; as ranking criterion, or a related coefficient. 
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(3) 
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What c c > mute ranking vith - hods is the 

irnplieit orthogonality" assumptions that are made, Each coefficient b>,< Is 
computed with information about a single feature {gene} and docs not take iato 
account mutual information between features. 

One use of feature ranking is in the design of a class predictor (or 
classifier) based on a preselected subset of genes. Each feature which is 
correlated (or ants-correlated) with the separation of interest is by itself such a 
class predictor, albeit an imperfect one, A simple method of classification 
comprises a method based on weighted voting: the features vote in proportion to 
their correlation coefficient. Such is the 'method used by Golub, et ai. The 
weighted voting scheme yields a particular linear discriminant classifier: 

*(*-#, (4) 

where w is w; * (&(+) - ^-Motf*) + c<{-)) and ji + u(-)}/2 

Another classifier 01 class predictor is Fisher's linear discriminant. Such 
a classifier is similar to that of Golub et ah where 

where S is the («,n) within class scatter matrix defined as 



where p. is the mean vector over all training patters and X(+) and X(-) are the 
training sets of class (+) and (->, respectively. This form of Fisher's 

30 discriminant implies that S is invest! ble, however, this is not the case if the 
number of features n is larger than the number of examples € since the rank of S 
is at moss L The classifiers, of Equations a and 6 are similar if the scatter matrix 
ss approximated by us diagonal dements. This approximation is exact when fee 
\ ctor formet - hi values 1 c ere acrosi train! »tiem n 

35 orthogonal, after subtracting -he class mean. The approximation retains some 



validity if the features arc uncorrected, thai is, if the expected value of the 
product of two different features is zero, after removing the class mean. 
Approximating S by its diagonal elements is one way of regularizing it (making it 
invettibte). However, features usually are correlated and, therefore, the diagonal 
approximation is not valid. 

One aspect of she present invention comprises using the feature Tanking 
coefficients as classifier weights. Reciprocally, She weights multiplying the 
inputs of a given classifier am be used as feature ranking coefficients. The inputs 
that are weighted fey the largest values have the most influence in the 
classification decision. Therefore, if the classifier performs well, those inputs' 
with largest « - Mid to the most informative features, or in this 

instance, genes, Other methods, known as multivariate classifiers, comprise 
algorithms to train linear discriminant function* that provide superior feature 
ranking compared to correlation coefficients. Multivariate classifiers, such as the 
Fisher's linear discriminant (a combination of multiple univariate classifiers) and 
methods disclosed herein, are optimized during training to handle multiple 
variables or features simultaneously, 

For classification problems, me ideal objective function is the expected 
value of the error, i.e., the error rate computed on an infinite number of examples. 
For training purposes, this idea! objective is replaced by a cost function / 
computed on training examples only. Such a cost function is usually a bound or 
an approximation of the ideal objective, selected far convenience and efficiency. 
For linear SVMs, the cost function is: 



25 /«CI/2ffi (7) 

which is minimized, under constraints, during training. The criteria (»# 
estimates the effect on the objective (cost) function of removing feature i. 

A good feature ranking criterion is not necessarily a good criterion for 
ranking feature subsets. Some criteria estimate the effect on the objective 

30 function of removing one feature at a time. These criteria become suboptimai 
when several features are removed at one time, which is necessary to obtain a 
small feature subset. 

Recursive Feature Elimination (RFE) methods can be used to overcome 
this problem. REE methods comprise iteratjveiy J) training rhe classifier , 2) 

35 computing I ranking criterion for ail features, and 3) removing She mature 
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having the smallest ranking criterion. This, iterative procedure is an example of 
backward feature elimination. For computational reasons, it may be more 
efficient to remove several features at a rime at the expense of possible 
classification performance degradation. In such a ease, the method produces a 
"feature subset tanking", -as opposed, to a "feature ranking". Feature subsets are 
nested, e.g., F, c F 2 a : . . , c F, 

If features are removed one as a time, this results in a corresponding 
feature ranking. However, the features mat are top ranked, i.e., eliminated last, 
are not necessarily the ones that are individually most relevant. It may be the case 
that the features of a subset F m are optimal in some sense only when taken in 
some combination. RFE has no effect on correlation methods since the ranking 
criterion is computed «sin§ information about a single feature. 

A preferred method of the present invention is to use the weights of a 
classifier to produce a feature ranking with a SVM (Support Vector Machine). 
The present invention contemplates methods of SVMs used for both linear and 
non-linear decision boundaries of arbitrary complexity, however, the example 
provided herein is directed to linear SVMs because of the nature of the data set 
under investigation. Linear SVMs are particular linear discriminant classifiers, 
(See Equation 1). If she (raining set is linearly separable, a linear SVM is a 
maximum margin classifier. The decision boundary (a straight line in the ease of 
a two-dimension separation) is positioned to leave the largest possible margin on 
either side. One quality of SVMs is that the weights w t of the d<*iskm$»stfes 
D(x) are a function only of a small subset of the training examples, i.e., "support 
vectors". Support vectors are the examples that are closest to the decision 
boundary and lie on the margin. The existence of such support vectors is at the 
origin of the computational properties of SVM and its competitive classification 
performance. While SVMs base their decision function on the support vectors 
that are the borderline cases, other methods such as the previously-described 
method of Golub, « al„ base the decision function on the average case, 

A preferred method of the present invention comprises using a variant of 
the soft-margin algorithm where training comprises executing a quadratic 
program as described by Cones and Vapnlk in "'.Support vector networks". 1995, 
Machine learning. 20:3, 2 73-207. which is herein incorporated in its entirety. 
The following is provided as an example, however, different programs are 
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contemplated by the present invention and can he determined by those skilled in 
she art ibr the particular data seta involved. 

Inputs comprise Training examples (vectors) <x )( x, ^...x^ and class 

labels {y !t yj ,.y t ). To identify the optimal hyperbaric, the following 

5 quadratic program is executed: 



10 with the -resulting outputs being the parameters a*., where the summations run 
over all training patterns at* that are * dimensional feature vectors.. x h * x k denotes 
the scalar product, y* encodes the class label as a binary value -t or -1 , 8 M w the 
Kroneefcer symbol (5« = 1 if S « Jr and 0 otherwise), and % and C are positive 
constants (soft margin parameters). The soft margin parameters ensure 

1 5 convergence even when the problem is non-lineariy separable of poorly 
conditioned. Itt such eases, some stippcrt vectors may not fie on the margin. 
Methods include relying on X or C, but preferred methods, and those used la the 
Examples below, use a small vjjjue of 1. (on the order of 10" w ) to ensure 
numerical stability. For the Examples provided herein, the solution is rather 
20 insensitive to the value of C because the training data sets are linearly separable 
down to only a few features. A value of t? a 100 is adequate, however, other 
methods may use other values of C, 



M 1 mizeoverc;^ 




(8) 



The resulting decision function of an input vector x is; 



25 



with 



(9) 
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The weighs vector w is a Swear combination of training patterns. Most weights a* 
are zero. The training patterns with non-zero weights are support vectors. Those 
having a weight thai satisfies the strict inequality 0<a*< C are ntargmaf support 
vectors. The bias value h is an average over marginal support vectors. 
3 The following sequence illustrates application of recursive feature 

elimination (RFE) to a SVM using the weight magnitude as the ranking criterion. 
The inputs arc training examples (vectors) : X 0 = \% u x 2 „...x i ,..x f j T and class 
!«hdsY = ry;.Y X ...,y t ...yjf. 
Mtaiize: 

10 Subset of surviving features 

Features ranked list 

r*\\ 
Repeat until Wf] 
15 Restrict training examples to good feature indices 

X-X 0 (-,s) 
Train the classifier 

Compute the weight vector of dimension Sengrhfs): 

20 

w 

Compute the ranking criteria 
c; a { W ;) J t for all i 
25 Find the feature with smallest ranking criterion 

t"~ arfiminic) 
Update feature ranked list 
r«{s(l),r] 

* t hef - vifh sma rankni ie 
30 s = s(i:f-l,f*l;ieRgth(s)} 

The output comprises feature ranked list t, 

lh 1 « d c 5 r ] ! 

algorithm to remove more than one feature per step. 
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in general, RFE npi 't T , , ensive hen eomt S as: a in 
eorrel > f 1 ' ' everai tt and inpu : f rt be rat 
>! in ut o! second using a Fentiu prxx t i > ( r V assi k 

trained o»iy once with all features, such as SVMs or pseudo-inversc/mean 
5 squared error <MSE). A SVM implemented using non-optimized Mattab® code 
on a Fendum® processor can provide a solution in a few seconds. To increase 
computational speed, RFE is preferably implemented by training multiple 
classifiers on subsets of features of decreasing stste. Training time scales linearly 
with the number of classifiers to be trained. The trade-off is computational time 
10 versus accuracy. Use of RFE provides better feature selection than can be 
obtained by using she weights of a single classifier. Better results are also 
obtained by eliminating one feature at a time as opposed to eliminating chunks of 
features. However, significant differences are seen only for a smaller subset of 
features such as fewer than 100. Without trading accuracy for speed, RFE can be 
15 used by removing chunks of features in the first few iterations and then, in later 
iterations, removing one feature at a time once the feature sot reaches a few 
hundreds. RFE can be used when the number of features, e.g., genes, is increased 
to millions. Furthermore, RFE consistently outperforms the nai'Ve ranking, 
particularly for small feature subsets. (The naYve ranking comprises ranking the 
20 features with which is computationally equivalent to the first iteration of 
REE,) The naive ranking order* features according to their individual relevance, 
while RFE ranking is a feature subset ranking. The nested feature subsets contain 
complementary features that iodividualiy are not necessarily the most relevant. 
An important aspect of SVM feature selection is that clean data is most preferred 
25 because outliers play an essential rote. Use selection of useful patterns, support 
vectors, and selection of useful features are connected. 

Pre-processing can have a strong impact on SVM-RFE, In particular, 
feature scales roust be comparable. One pre-processing method is to subtract the 
mean of a feature from each feature, then divide the result by us standard 
30 deviation. Such pre-processing is not necessary if scaling is taken into account in 
the computational cost function. 

in addition to the above-described linear case. SVM-RFE can be used in 
nonlinear cases and other kernel methods. The met! tins features >n 

the basis of the smallest change in cost function, as described herein, can he 
35 stone linear uses as kernel methods in genera! reputation 
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can be made tractable by assuming no change in the value of the a's. Thus. the 
eiassifer does »ot need to be retrained for every candidate feature to be 
eliminated. 

•Specifically, in the- case of SVMs, the cost function to be minimized 
5 (under the constraints 0 < a* < C and I*o*y« = 0) is: 

J-{mSx«<ri. (10) 

where // is the matrix with dements j4nA'{x*Xt)» if is a kernel function that 
10 measures the similarity between X* and and i is an £ dimensions! vector of 
ones. 

An example of such a kernel function is 

KiXkXi) = expHU x A ,Xiji % (|f) 

!5 

To compute the change in cost function caused by /removing input 
component i, one leaves the a's unchanged and recomputes matrix ii. This 
corresponds to computing K{x H < r i), x k (-f), yielding matrix M r i)> vvhere the 
notation (-/) means thai component i has been removed. The resulting ranking 
20 coefficient is: 

DJ{i) » {m)a T m^m}a r HH)a ( J 2) 

The input corresponding to the smallest difference &/fi? shall be removed. Hie 
2 - proced i varr\ out Recarsivt f >n (RFE) 

The present invention is directed to methods, systems and devices for 
using stato-of the art classifiers such as a SVM disclosed herein, that uses RFE, to 
provide information to others through readily-accessed channels. A preferred 
embodiment of the methods of providing such information comprises the 
30 following steps. Data to be analyzed is provided. This data may come from 
customers, research facilities, academic institutions, national laboratories, 
commercial entities or other public or confidential sources. The source of the 
data and the types of data provided arc. nest crucial to the methods. The data may 
be provided to the SVM through any means such as via the internet, server 
35 linkages or discs CDs DVDs ot omer storage means 
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The 'data is input into computer system, preferably a SVM-REE. The 
SVM-RFE is ran or® or more times to generate the best features selections, 
which can fee « >, ed n ar v >, i graph The SVM may use any 
algorithm and ihe data may be preprocessed and postprocesscd if needed. 
5 Preferably, a server contains a first observatiot graph 1 i . ixes the results of 
the SVM activity sad selection of features. 

The information generated by the SVM may be examined by outside 
xpens comjw databa 1 ? c . p i itton sc ees For 

example, if the resulting feature selection information is about selected genes, 
10 biologists or experts ar canjjsiter databases may provide complementary 
information about the selected genes, for example, from medical and scientific 
literature. Using ail the data available, the genes are given objective or subjective 
grades. Gene interactions may also be recorded. 

The--: i ' - SVM and the other information sources 

15 are combined to yield a global combined graph. The graph provides information 
such as multiple alternative candidate subsets of selected features, such as genes, 
with scores attached to them. For example, in the gene selection data used 
herein, the score reflects how predictive the genes are from a statical point of 
view and how interesting they ate from a biological point of view. 
20 The graph can be explored with a computer means, such as a browser. 

The knowledge base may be built interactively while exploring the graph. The 
results of the study, such as the best fitting genes, are returned to the data 
provider, of to the final user, or are sen! to another entity which desires the 
information or are made available on the internet or via a dedicated on-line 
25 servive. Financial transactions may also occur at several steps. A final user is 
one who receives the information determined by the methods herein. 

A preferred selection browser is preferably a graphical user interface that 
would assist final users in using the generated informauou. For example, in the 
examples herein, she selection browser is a gene selection browser that assists the 
30 final user is selection ot potential drug targets from the genes identified by the 
SVM RFE, The inputs are the observation graph, which is an output of a 
statistical analysis package and any complementary knowledge base information, 
preferably in a graph or ranked form. For example such complementary 
((formation or gent selection nay in ! , ledge a! ra » genes mc ions. 
35 ^r p» „ w v s \ uttot. , hi u i i 
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preferably allows for vis i\ < sp! ation ft! „p) * r xjuct of the two 
graphs to identify promts > y requ 

intensive computations and if needed, can be run on other computer means. The 
g«f* generated by the server can be precomputed, prior to access by the browser, 
5 or is generated in situ and functions by expanding the graph at pomes of interest. 

in a preferred embodiment, the Server is a statistic maJj > ackage md 
in the gene feature selection, a gene selection server. For example, inputs are 
patterns of gene expression, from sources such as D.N A mksoarrays or other data 
sources. Outputs are an observation graph that organizes the results of one or 
1.0 more runs of SVM fcFB. It is e^timam to have the selection server run the 
comt n , pens ve operations. 

A preferred method of the server is to expand the information acquired by 
the SVM. The server can use any SVM results, and is not limited to SVM RFE 
selection methods. As an example, the it etl i t Br t i to gene selection, 
1 5 though any data can he treated by the server. Using SVM RFB for gene selection, 
gene redundancy is eliminated, but H is informative to know about discriminant 
genes that are correlated with the genes selected. For a given number N of genes, 
only one combination is retained by SVM-RFE. In actuality, there are many 
com binations of N di fferent genes that provide similar results, 
20 A combinatorial search is a method allowing seiection of many alternative 

combinations of N genes, but this method is prone to overfitting the data. SVM- 
RFE does not overfit the data. SVM-RFE is combined with supervised clustering 
to provide lists of alternative genes that are correlated with the optimum selected 
genes- Mere substitution of one gene by another correlated gene yields 
25 stbstunt^.vL- nance degradation. 

An example of an observation graph containing several runs of SVM-RFE 
for colon data is shown in FIG, 9. A path from the root node to a given node in 
the tree at depth D defines a subset of D genes. The quality of every subset of 
genes can be assessed, for example, by the success rate of a classifier trained with 
30 thes* genes. The color of the test node of a given path indicates the quality of the. 
subset. 

The graph has multiple uses. For example, in designing a therpeutic 
composition that uses a maximum of four proteins, the statistical analysis does 
no; ;ake into account: which proteins »re easser to provide to a patient. In the 
35 c oh theprei ed un« strained path i x indicated b h< 
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in {he tree, 'from the root node to the darkest !eaf nods. This path corresponds to 
running a SVM-RFE. If it is found that the gene selected at this node is ds ffkuit 
to bsc, a choice can be. made to use the alternative protein, and follow the 
remaining unconstrained path, indicated by bold edges. This decision process can 
5 be optimized by using the notion of search discussed below in a product graph. 

in FIG, 9, a binary &ee of depth 4 is shown. This means that for every 
gene selection, there are only two alternatives and selection is limited to four 
genes. Wider trees a How for sieetion from a wider variety of genes. Deeper trees 
allow for selection of a larger number of genes, 
10 Aii example of construction of the tree of the observation graph is 

presented hetein and shown in FIG, 10. 'Hie steps of the construction of the tree 
of FIG & is shown in FIG. 10, in A, all of the oldest dependents of the root are 
labeled by the genes obtained from regular SVM-RFE gene ranking, "the best 
««tkh>g gene is closest to the root node. The other children of the root, from 
13 older to younger, and ail their oldest deeendents are then labeled. In the case of a 
binary tree, there are only two branches, or children, of any one node #}. The 
top ranking gene of A is removed, and SVM-RFE is run again. This second level 
of the tree is filled with the top ranking genes, from roof to leaf. At this stage, all 
the nodes that ate at depth I are labeled with one gene. In moving to fill the 
20 second level, the SVM is run using constrained RFE, The constraint: is that the 
gene of She oldest node must ne ver be eliminated. The second child of the oldest 
node of root and ail its oldest dependents are labeled by running the constrained 
RFE. (C). 

The examples included herein show preferred methods for determining 
25 the genes that are most correlated to the presence of cancer or can be used to 
predict cancer occurence in an individual, The present invention comprises these 
methods, and other methods, including other computational methods, usable tn a 
learning machine for determining genes, proteins or other measurable criteria for 
the diagnosis or prognosis of changes in a biological system. There is no 
30 limitation to the source of the data arid the data can be combinations of 
measurable criteria, such as genes, proteins or clinical tests, that are capable of 
being used to differentiate between normal conditions and changes in conditions 
in biological systems. 

in the following examples, preferred numbers of genes were determined 
33 thai result from separation of the data thai discriminate. These numbers are not 
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limiting to the methods of the present invention. Preferably, the preferred 
optimum number of genes is a range of approximately from 1 to 500, more 
preferably, the range is from 1.0 to 250, from J to 50, even more preferably 5he 
range is from 1 to 32, still more preferably she range is from 3 to 21 and most 
5 preferably, from 1 to it). The preferred optimum number of genes can be affected 
by the quality and quantify of the original data and thus can be determined for 
each application by those skilled in the art. 

Once the determinative genes are found by the learning machines of the 
present invention, methods and compositions for treaments of the biological 
10 changes in the organisms can be employed. For example, for the treatment of 
colon cancer, therapeutic agents can be administered to antagonize or agonize, 
enhance or inhibit activities, presence, or synthesis of the gene products. 
Therapeutic agents and methods include, hut are not limited to, gene therapies 
such as sense or antisense polynucleotides, DNA or RNA analogs, 
15 pharn <ceui.il ;er iasmaphor nriangiogemcs and derivatives a alogs 
and metabolic products of such agents. 

Such agents are administered via parenteral or noninvasive routes. Many 
active agents are administered through parenteral routes of administration, 
intravenous, intramuscular, subcutaneous, intraperitoneal, intraspinal, intrathecal, 
20 intracerebroveniricuiar, intraarterial and other routes of injection. Noninvasive 
routes for drug delivery include -oral, nasal, pulmonary, rectal, buccal, vaginal, 
transdermal and occular routes. 

Another embodiment of the present invention comprises use of testing 
remote from the site of determination of the patterns through means such as the 
25 internet or telephone lines. For example, a genomic test to identify the presence 
of genes known to be related to a specific medical condition is performed in a 
physician's office. Additionally, other information such as clinical data or 
proieomtc determinations may also be made a; the same time or a different time. 
The results of one, some or ail of the tests are transmitted to a remote site that 
30 houses the SVMs. Such testing could be used for the diagnosis stages, for 
determining the prognosis of the disease, for determining the results of therapy 
and for prescriptive applications such as determining which therapy regimen is 
better for individual patients. 

This Invention is further illustrated by ihe following examples, which are 
35 t , x xr 
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On the contrary, it is to be clearly understood thai resort may be had to various 
other embodiments, mod Heat - r.d |ujvsict»t* hereof which, after reading 
the description herein, may suggest themselves to those skilled in the art without 
departing from the spirit of the present invention and/or the scope of the 
5 appended -claims. 

EXAMPLE 1 

Analysis of gene patterns related to colon 

Analysis of data from diagnostic genetic testing, mieroarray data 
10 of gene expression vectors, was performed with a SVM-RFE . The original data 
for this example was derived from the data presented in Alan ei ah, 1999 Gene 
expression information was extracted from mieroarray data resulting, after pre- 
processing, in a table of 62 tissues x .2000 genes, The 62 tissues include 22 
normal tissues and 40 colon cancer tissues. The matrix contains the expression of 
13 the 2000 genes with highest minim*}' intensity across the 62 tissues. Some of the 
ics are non-1 nan ne 

The data proved to he relatively easy to separate. After 
preprocessing, it was possible to a find a weighted sum of a set of only a few 
genes that separated without error the entire data set, thus the data set was linearly 
20 separable. One problem in the colon cancer data set was that tumor samples and 
normal samples differed in cell composition. Tumor samples were normally rich 
in epithelial cells wherein tromsal samples were a mix ture of cell types, including 
a large fraction of smooth muscle ceils. While the samples- could be easily 
separated on the basts of cell composition, this separation was not very: 
25 informative for tracking cancer-related genes. 

Aioti et al. provides an analysis of the data fcased on top down 
hierarchical clustering, a method of unsupervised learning. The analysis shows 
that most normal samples dusts r together arsd most cancer samples cluster 
together, Aion et al. explain that "outlier" samples that are classified in the 
30 wrong cluster differ in cell composition from typical samples. They compute a 
muscle index that measures the average gene expression of a number of smooth 
muscle genes. Most normal samples have high muscle index and cancer samples 
low muscle index. The opposite is mie tor most outliers. 

Alon et ai. also cluster genes and show that some genes correlate 
35 with a cancer vs norma! .sej , s . ». chemc >j% do u t suggest a specific iem< 1 
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>f gene selecti i fhey - ow thai s ms genes ire . J t the cancer vs. 

eparation but d >u t spe method of gen 

The gene 1 1 1 method ho present ictvc it xi is 

compared against a reference s tod icseribed m Goiub et ai, 

5 Scimce>l999> which is referred to as she "baseline method" Since there was no 
defined training and test set, the .data was randomly split into 31 samples for 
training and 31 samples for testing; 

in Goiube? ai.. the authors use several metrics of classifier quality, 
including error rate, rejection rate stf fixed threshold, and classification 
50 confidence. Each value is computed both on the independent toss set and using 
thcteavc^me-oat method on the training set. The ieave-one-out method consists 
of removing one example from the training ses, constructing the decision function 
on the basis only of the remaining training data and then Jesting on the removed 
example, in this method, one tests aii examples of s he training data and measures 
1 5 the fraction of errors over the total number of training examples. 

The methods of this Example of using the learning machine of the- present 
invention use a modification of the above metrics. FIG. 14 graphically illustrates 
use of a linear discriminant classifier. A) Separation of the training examples with 
an SVM 3) Separation of she training and test examples wish the same SVM, 
20 C) Separation of the training examples with the baseline method. 1>) Separation 
of the framing and test examples with the baseline method. The present 
classification methods use various decision functions (-O(x) whose inputs are gene 
expression coefficients and whose outputs are a signed number indicative of 
whether or not eatieer was present. The classification decision is carried out 
25 according to the sign of D(\). The magnitude of £?{*) is indicative of 
classification confidence. 

Pour metrics of classifier quality were used, (see FIG, 12) 
Error (B 1 +82) = number of errors ("bad") at zero rejection. 
Reject (RUR2) - minimum number of rejected samples to obtain zero error 
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Extremal margin (E/D) * difference between the smallest output of she positive 
class samples and the largest output of the negative class samples (reseated by the 
largest difference between outputs). 

Median margin (WD) ~ difference between the median output of the positive 
5 class samples and the median output of the negative class samples {reseated by 
the largest difference between outputs). 

Bach value is computed both on the training set with the ieave-one-out 
method and on me testset. 

Hie error rate is the fraction of examples that are misciassified 
10 (corresponding to a diagnostic error). The error rate is complemented by the 
success rate. The rejection rate is die fraction of examples that ate rejected (on 
which ao decision is made because of low confidence). The rejection rate is 
complemented by the acceptance rate. Extremal and median margins are 
measurements of classification confidence. Note mat the margin computed with 
15 the leave-one-out method or on the test set differs from the margin computed on 
training examples sometimes used in modei selection criteria. 

A method for predicting (he optimum subset of genes composed defi ning 
a criterion of optiroaiify (fiat uses information derived from training examples 
only. This criterion was checked by determining whether the predicted gene 
20 subset performed best on the test set. 

A criterion that is often used in similar "snodei selection" problems is the 
leave-one-out success rate V m . In the present example, it was of tittle use since 
differentiation between many classifiers that have zero leave-one-out error is not 
allowed, Such differentiation is obtained by using a criterion that combines all of 
25 the quality metrics computed by cross-validation with the leave-one-out method; 

where V Slit is the success rate, V« c the acceptance rate, V m the extremal margin, 
and V m;S j is the median margin. 

Theoretical considerations suggested modification of this criterion to penalize 
30 iargc- gene sets, indeed, the probability of observing large differences between 
the leave- fine-out error and the test error increases with the size d of the gene set, 
using the formula below 

s(d) - sq«(-iog(et)+log{G(d))) . m(p(l"0n) 

35 
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where {!-«) is the confidence (typically 95%. i.e., a -0,05); 
p is the "true" error rate <p<»0.Ql); and 
rs is the size of the training set. 

Following *e guaranteed risk principle ("V'apnik, 1974), a quantity proportional to 
5 e{d) was subtracted from criterion Q to obtain a new criterion: 
C«Q~2s{4} 

The coefficient of proportionality was computed heuristicaily, assuming 
that Vsw, V^, V«; and V»«t are independent random variables with the same 
10 error bar e (d) and that this error bar is commensurate to a standard deviation, iht 
this case, variances would be additive, therefore, the error bar should be 
multiplied by sqrt{4). 

A more detailed discussion of the methods of a preferred embodiment 
follow. A SVM-RFE was fun on the raw data to assess she validity of the 
15 method. The colon cancer data samples were split randomly into 31 examples 
for training and 31 examples for testing, The RFB method was ran to 
progressively downsize the number of genes, each time dividing the number by 2, 
'Oje preprocessing of me data for each gene expression value consisted of 
subtracting the mean from the; value, then dividing the resaitby the standard 
20 deviation. 

The leave-one-out method with the classifier quality criterion was used to 
estimate the optimum number of genes. The Ieave-one-out method comprises 
taking out one example of the training set. Training is then performed on the 
remaining examples, withthe left out example being used to test the trained 

25 classifier. This procedure k iterated over all the examples. Every criteria is 
computed as an average over all examples. The overall classifier quality criterion 
is the sum of 4 values; the fcave-one-om success rate (at zero rejections), the 
leave-one-out acceptance rate (at zero error), the ieave-one-oui extremal margin, 
and the Ieave-one-out median margin 1 hi v. jssit . , , . , ^, , r V vith 

30 hard margin. 

Results, of the SVM-RTE as taught herein show that at the optimum 
predicted by the method using training data only, (he leave-one-out cnor is zero 
and the -est performance is actually optimum. Four genes ^re discovered and 
they arc; 
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Ifflm- Human MX ( I enRNA, complete cds. 

14737? 71035 S-iOOP PROTEIN (HUMAN), 

M76378 Human cysteine- rich protein (CRP) gone, exons 5 

and 

250753 H.sapiens mRNA for GCAP-tf/uroguartyltn 
precursor. 

The. optimum test performance had an 81% success rate. This result was 
consistent with the results repotted in the origins! paper by Aloti et al. Moreover, 
the errors, except for one, were identified by A Ion et aJ. as outliers. The errors 
were 8, 36, 34, 12, -36, and -30, with 36 being ihe error not identified by Akm et 
a), as an outlier; The number identifies the tissue while the sign indicates: 
presence or absence of tumor (negaijveKtumor, positive or no sign-normal). Mo 
direct performance comparison was made because A ion et ai are using 
unsupervised learning, an the entire data set whereas this embodiment used 
super,! et irningon half of ■ t of the rformanct curves at 

■a function of gene number is shown in Figure .12. The description of the graph of 
Figure 12 is as follows: 

Horizontal axis = iog2(ttumber of genes). 

Vertical axis ssdccess rate. 

Curves: circle - test success rate; 

square = ie-ave-one-out qua! try criterion; 
feiahgfe = epstfon (theoretical error bar); 
diamonds = square ~ triangle (smoothed) predictor 
of optimum test success rate. 
The optimum of the diamond curve is at log2(num genes) »2»> man genes = 4 
which coincides with the optimum of the circle curve. 

Preprocessing Steps 
'faking (he leg 

The initial preprocessing steps of the data were described by A Ion et af 
The data was further pre-processed in order to reduce the skew in the data 
distribution. Figure 13 shows she distributions of gene expression values across 
•issue samples for two random genes {cumulative number uf samples of a given 
expression value) which is compared with a uniform distribution, Each line 
represents a gene. FIGS, OA and }3B show me raw date; FIGS. 13 C and 13D 
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arc the same data after taking the log. By isiJ&jg the log of the gene expression 
values She same curves result and the distribution Is more uniform. This may be 
iue to tht la sion coefl s n obtains 

the ratio of two values. For instance, in a competitive hybridization scheme, DNA 
5 from I wo samples that are labeled differently are hybridised or.so the array. One 
obtains at every point of the array two coefficients corresponding to the 
fluorescence of the two labels and reflecting the fraction of DNA of either sample 
tha; hybridized to the particular gone. Typically, the first initial preprocessing 
step mat is taken is to take the ratio a/b of these two values. Though this initial 
10 preprocessing step is adequate, « may not be optimal when me two values are 
small. Other initial preprocessing steps include {a-b)/(a+b} and (log a - log 
b)/(log a f log b), 

Suhlraciing she array mean 

15 Figure 14 shows the distribution of gene expression values across genes 

for all tissue samples, FIG. 14A shows the raw data and HQ, I4B shows the shv 
erf. The shape is roughly that of an eif function, indicating that the density 
follows approximately the Normal law. Indeed, passing the data through the 
inverse erf function yields almost straight parallel lines. Thus, it is reasonable to 

20 normalize the data by subtracting the mean. This preprocessing step is supported 
by the fact mat there are variations in experimental conditions from microarray to 
microarray. Although standard deviation seems to remain fairly constant, the 
other preprocessing step selected was to divide the gene expression values by the 
standard deviation to obtain centered data of standardized variance 

25 

% rmalizing < < s km across th a samptes 

Using training data only, the mean expression value and standard 
deviation for each gene was computed. For all the tissue sample values of that 
gene (training and test), that mean was then subtracted and the resultant value 
30 was divided by the standard deviation, In some experiments, an additional 
preprocessing step was added by passing the data through a squashing function to 
diminish die importance of the outliers. 
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New RFE results 

The data was preprocessed as described above to produce new and 
improved results.- The code was optimisted sach that RFE can be run by 
eliminating one. gene at a time. The gene selection cross-validation process used a 
5 regular SVM. 

The results of Figure IS show a significant improvement over those of 
Figure 12. FIG. iS shows the results of RFE after preprocessing, The description 
for FIG. 15 is as follows: Horizontal axis » )og2(nurober of genes), Curves: circle 
- test success rate; square = leave-one-out quality criterion; triangle - epsiion 
10 {theoretical error bar); diamond =• square - triangle (smoothed) predictor of 
optimum test success rate the optimum of she diamond curve is at Iog2£tao 
genes) - 4 «> mm genes = 16. Reduced capacity SVM used in FIG. 12 is 
replaced by plain SVM. Although a log scale is still used for gene number, RFE 
was run by eliminating one gene at a time. The best (est performance is 9&% 
15 classification accuracy (S genes). The optimum number of genes predicted from 
the classifier quality based on training data information only is 16. This 
corresponds to 87% classification accuracy on the test set. The same test 
performance is also achieved with only 2 genes as follows: 
102854: Myosin regulatory light chain 2, smooth muscle isoform human); 
20 contains element TAR i repetitive element. 

1 0: S3639G iai processing pe.[ 

Neither of these two genes appears at the top of the list in the first 
experiment. 

The top gerte found is a smooth muscle gene, which is a gene 
25 characteristic of tissue composition and is probably not related to cancer. 
Companion with Goiub's method 

Golub's gene selection method is a ranking method where genes are 
ordered according to the correlation between vectors of gene expression values 
for ail training data samples and she vector of target values for normal sample 
30 and -1 for cancer sample). Odft* ei ai select m/2 top ranked and m/2 bottom 
ranked genes to obtain one half of genes highly correlated with the separation and 
one half ami -correlated. Goiub et ai use a linear classifier To classify an 
unknown sample, each gene "votes" for cancer or normal according to us 
correlation coefficient wish the target separation vector. The top gene selected by 
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Golub's method was J02854 (smooth muscle related). HG. 16 illustrates the 
comparison of this embodiment's use of the baseline method with GoStib et ai. 
The same curves as were used in FIG. 15 are shown in FIG, 16. The description 
for 'Figure 16 is as follows: Horizontal axis - log2(mimber of genes). Curves: 
3 circle = test success raie; square - Icave-one-out quality criterion; triangle ^ 
epsiion (theoretical error bar); diamond - square - triangle (smoothed) predictor 
of optimum test success rate. The data, pre-processed identically in FIGS, IS and 
16, was then nested by Goiub's method and graphed in Figure 19. It is the novel 
finding of the present inventors to select an optimum number of genes to use with 
1 0 learning machines such as SVMs. 

To compare the results of the methods of this embodiment of the present 
invention and Golub, a statistical test was used that determines with what 
confidence (l-n) that one classifier is better than the other, using the formula: 
(1-T1) * 0.5 + 0.5 crf{ z n / sqrtC2) ) 
15 s « t f sqn(v) 

where t is the number of test examples, v is the total number of errors (or 
rejections} that only one of the two classifiers makes, and e is Jhe difference in 
error rate (or in rejection rate) and erf is the error function ertYx) = j'cxpi-^jrft .... 
This assumes i.t.d, (independent and identically distributed) errors, one-sided risk 
20 and the up;" t m i of the Binomiaiiaw by the Normal law. 

Tins formula was applied to the results summarized in Table 6, In either 
case, e - 3/31 and y ~3 The total number of test examples is n - 31. On the 
basis of this test, the methods' of this embodiment of the present invention were 
better than Golub with 95.8% confidence 
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| Method 


] Optimum error rale 


j Error rase ai She optimum | 
' i ' v. f genes 


j -SVM.RFE | 


9.68 \ 


j 12.90 


| Golub ; 


19.35 


1 



Tabled Etxor rates com xisons betw then > t j jodiment of the 
present invention and Goiub's method. The- sign indicates cancer (negative) or 
normal (positive). For this ' embodiment of she present invention, the best 
5 performance, was at 8 genes and the optimum predicted at 16 genes. For Golub, 
the best performance was at 56 genes .and the optimum predicted at 4 genes. 
Note that there was, only one error difference between the best performance and 
die optimum predicted in either case. 

1 0 Combining clustering and gene selection 

Because of data redundancy* it was possible to find many subsets of genes 
that provide a reasona <e paragon I n yze the results, it was optimal to 
understand how these genes are related- While not wishing to be bound by any 
particular theory, it was the initial theory that the problem of gene selection was 
13 to find an optimum number of genes, preferably small, that separates normal 
tissue ft " catK •■ t « sues * :;;h max mt« • as u; jc\ 

SVM-RPE used a subset of genes that were oowpiettsetitary and thus 
carried little redundant information. No other information on the structure and 
nature of She data was provided. Because data were very redundant, a gene that 
20 had not been selected may nevertheless be informative for the separation. 

Correlation methods such as Goiub's method provide a ranked list of 
genes. The rank order characterizes how correlated the gene is with the 
separation, Generally, a gene highly ranked taken alone provides a better 
separation than a lower ranked gene, it h therefore possible to set a threshold 
25 (e.g. keep only the top ranked genes) that separates "highly informative genes" 
from "less informative genes'*. 

The methods of the present invention such as SVM-RFB provide subsets 
of genes that are both smaller and more discriminant. The gene selection method 
using SVM-RFE also provides a ranked list of genes. With this list, nested 
30 subsets of genes of increasing sizes can be defined. However, the fact that one 
gene has a higher rank than another gene does not mean that this one factor alone 
characterizes the better separation, in fact, genes that arc eliminated vets' eariv 
may be very informative but redundant with others that were kept. The 32 best 
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ene $ a wh d S t > individual!) ma) «a be very 

eotielafed with the target separation. Gent; ranking allows for a building nested 
subsets ot t-u* !, i - lu J 4 eparations it is not mi rm&tive fei h« w 
good an Individual gene may be. Genes of any rank may be combated with the 
5 32 best genes, The correlated genes may be ruled out at some point because of 
their redundancy with some of the- remaining genes, not because they did not. 
carry information relative to the target separation. 

The gene ranking alone is insufficient to characterize which genes are 
informative and which one? are not, and also to determine which genes are 
10 complementary and « hich ones -.edundant. 

To overcome the problems of" gene ranking alone, the data was 
pre-processed with an unsupervised clustering method. Genes were grouped 
15 according to resemblances (according to a given metric). Cluster centers were 
then used instead of genes themselves and processed by SVM-RFE to produce 
nested subsets of cluster centers. An optimum subset step can be chosen with the 
same cross-validation method used before. 

Using the data, the QT dUR clustering algorithm was used to produce 100 
20 dense clusters. The similarity measure used was Pearson's correlation coefficient 
(as commonly used for gene clustering). HQ. i? provides the .performance 
curves of the results of RFE when training on 100 dense Q1W clusters. 
Hoffeon&f axis « log3 (number of gene cluster censers). Curves: circle - test 
success rates; square * feave<-one-ou5 quality criterion; triangle = epsilon 
25 (theoretical error bar); diamond = square - triangle (smoothed) predictor of 
optimum test success rase the optimum of the diamond curve is at Jog2(mimber of 
gene cluster centers) = 3=o number of gene cluster centers ». 8. 

The results of this analysis are comparable to those of FIG. 15. The 
cither elements are listed in Table 7. 

30 
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Table 7; QT dlKK dusters selected with REE. The higher the cluster rank t'Rkj, 
the more important the s M t niaimui r coeffi ent 

between duster elements. GANMJene Accession Number. 

5 

With unsupervised chwtermg, a set of informative genes is defined, bur 
(here «« no guarantee that the genes not retained do not carry information. When 
RHB was used on aii QT d « : Clusters plus she remaining nets -clustered genes 
^ s ctoncitisies he performance tm were ^yite similar, though th 
10 oi ge <. tersse x ed wa cc n etelymft nt m ac uded r srk $ g e« ms 
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The genes selected in Table I arc organized in a structure: within a cluster, genes 
are redundant, across clusters they are complementary. 

The cluster centers can be substituted by any of their members. This factor 
may be important sn the design of some medical diagnosis tests. Fear example, the 
5 administration of some proteins may be easier than that of others. Having a 
choice of alternative genes introduces flexibility in the treatment and 
administration choices. 

Ten random choices were tested, in that one gene of each of the 8 clusters 
was selected randomly. The average test set accuracy was 0.80 with a standard 
10 deviation of 0.05. This is to be compared with 0.8? for she cluster owners, One 
of the random choice tests yielded an accuracy that was superior to that of the 
centers (QM)', D23672, TS1023, T85247, R89377, R51749, X55187, R392Q9, 
U09564. 

Hierarchical clustering instead of QT {tus( clustering was used to produce 
15 lots of small clusters containing 2 elements on average. Because of the smaller 
cluster cardinality, there were fewer gene alternatives from which to choose. In 
this instance, hierarchical clustering did not yield as good a result as using QT £(jW! 
clustering. The present invention contemplates use of any of the Known methods 
for clustering, including but not limited to hierarchical clustering, OT £luSi 
20 clustering and SVM clustering. The choice of which clustering method to 
employ In the invention is affected by the initial data and the outcome desired, 
and can be determined by those skilled in the art, 

Supervised clustering 

•25 Another method used with the present invention was to use clustering as a 

post-processing step of SVM-RFE. Each gene selected by running regular SVM- 
RTF. on the original set of gene expression coefficients was used as a cluster 
center. For example, the results. described with reference to FIG. ! 5 were used. 
For each of the top eight genes, the correlation coefficient was computed with all 

30 remaining genes. The parameters were that the genes clustered to genei were the 
genes that met the following two conditions: must have higher correlation 
coefficient with gene i than with other genes in the selected subset of eight genes, 
and must have correlation coefficient exceeding a threshold 0. 

in the Figures and Tables presented herein the results for 8 genes were 

35 presented. 




The clustered gsnes are listed in TafeteS. 
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(.Rk), the marc "relevant" the cluster should be, Mm corral is the minimum 
c , orre!at 3 oe nchjst elements. GAN=Gen Accession Numbet 

Die cluster centers are p a di 1 by a tar 

Compared to the unsupervised clustering method and results, the 
supervised clustering method, in this instance, does not provide better control 
over she number of examples per cluster. Therefore, this method is not as good 
as unsupervised clustering. if the goal is the ability to select from a variety of 
genes in each cluster. However, supervised clustering may show specific clusters 
that have relevance for the specific knowledge: being determined. In this 
particular embodiment, in particular, a very large cluster of. genes was found that 
contained several muscle genes that may be related to tissue composition and may 
not be relevant to the cancer v Si normal aeparailon. Thus, those genes are good 
candidates for elimination from consideration as having me bearing osi the 
diagnosis or prognosis for colon cancer. 

Factoring out (issue composition related gmes 

He following method was directed to eliminating the identified tissue 
composttion-related genes automaticaiiy. Genes of this type complicate the 
analysis of the resul ts because it was not possible to differentiate them from genes 
that, are informative- for the cancer vs. normal separation. The results with the 
unsupervised learning preprocessing showed that the top ranked genes did not 
contain she key words "smooth muscle" that were used to detect potential tissue 
composition related genes. A cardiac muscle gene was still selected under this 
method. 

Using the training set/test set split that was described earlier, other 
methods were used. For example, some of the top ranked genes were eliminated 
and the gene selection process was run again until there were no more "smooth 
muscle" genes or other muscle genes in she top ranked genes. Howe ver, the 
performance on the test set deteriorated and there was no automatic criterion that 
would allow the determination of when the gene set was foe of tissue 
eompi j e di ed get 

hi a preferred method of the present invention, the gene selection process 
was performed on the entire data set With a larger number of training samples, 
d;e teaming machine, such as the SVM used here, factored out tissue composition 
related genes While not wishing to be bound by any particular theory, it is 
e id that the SVM property 1 wusing x>rd* ine cases (st v'ect I 
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may take advantage of a few examples of cancer tissue och in muscle cells and of 
m t h i« epitf ct t se of the avenge, trend 

The resulting top tanking getses were free of muscle related genes, 
including the genes that were clustered with supervised clustering. In contrast, 
5 Ooiub's method obtains 3 smooth muscle related genes m the 7 top ranking gene- 
cluster alone. Farther, the top ranking genes found by SVM-RFE were all 
characterizing the separation, cancer vs. normal (Table 9). The present invention 
is able to not only make a quantitative difference on this data set with better 
classification accuracy and smaller gene subset, but is also makes a qualitative 
1 0 difference in that the gene set is free of tissue composition related genes. 



Rk 

T~ 


•Sim 


"GAN 


Description 


t > id k futKth o da ot 
to colon cancer 




H08393 


COLLAGEN 
ALPHA WW 
(MAIN (Homo 
sapiens) 


g* r i ,< \cJ 1 

idhesion. i «C3 
ceils have collagen 

. g activity as part 
of the metastatic process. 






M59W0 


Human cell adhesion 
molecule (CD44) 
mRNA, complete 

- j 


D44 julatec w ,< 
oton aden trdra on 
amdt ceils transit to the 
ra a tatk state 


3 




T94570 


Human 

cintoiriosidase 
precursor mRNA, 
complete cds. 


Mother chstinase (BRP39; 
was found to play a role its 
weast cancer Cancel cells 
vet ;rodnce this chitinase 

SI)TY!\ ipOp! S 






H8155S 


PRGCYCLiC 
FORM SPECIFIC 
POLYPEPTIDE B I- j 
ALPHA j 
PRECURSOR 
(Trypanosoma 
brucei bracei) 1 


t was shown that patients 
i >:> ed i I'tvi mosoma 
a colon parasite) develop 
essstanee against colon 



WO »2/U5S>*22 



71 



5 


* 


MS?*} 


ATPSYN HA ;S 

COUPLING 

FACTOR 6, 

MITOCHONDRIAL 

PRECURSOR 

(HUMAN) 


G'i . 'ib < . i rzj se 

that helps build blood 
vessels that feed the 
tumors. 


6 




T6294? 


60S RIBQSOMAL 
PROTEIN L24 
(Arabidopsis 
thaliana) 


t> a role in 
;oruroHing ceif growth and 
proliferation through the 
defective translation of 
mtieuUt classes of 
tjRNA. 


7 




H64807 


PLACENTAL 
FOLATE 
TRANSPORTER 
ap ens) 


*>■« ed t; tus oi kit 

His been associated with 
enhanced risk of colon 
Dancer. 



Table 9: The 7 top ranked genes discovered by the methods of the present 
invention, in order of increasing importance. Rk: rank. Sgn: sign of correlation 
with the target separation, - for over-expressed in roost cancer tissues; + for over- 
5 expressed in most normal tissues; CAN: Gene Accession Number; The possible 
function is derived from a keyword search involving "colon cancer" or "cancer" 
and some words in the gene description. 

FIG. 18 plots the results of the methods of the present invention using 
10 SVM-RFE after training on the whole data set In FIG. IS, the curves of theplot 
are identified as follows: Horizontal axis - iog2( number of gene cluster centers). 
Vertical axis = success rate. Curves: solid circle - training success rate; dashed 
black = Seave-one-oat success rate; square = leave-one-out quality criterion; 
triangle - epsi >t ithe rei . :rr - b i diamond * square - triangle (smoothed) 
15 predictor of optimum test success rate. The optimum of the diamond curve is at 
log?.(nuni genes) -5 -> num genes= 32. 

For comparison, FIG 19 plots the results obtained with Golub's method 
when training on the entire data set. The curves of this plot are identified as 
+ >it vs T iori^-)3 i er centers) v. rrves ei cle 

20 naming success rate; dashed black « leuve-one-out success rate; square ~ leave- 
one-out quality cnterion; triangle - epstkm (theoretical error bar); diamond « 
- t - is j shed edi rot mm test success rate the optimum 
of the diamond curve is at iog2(num genes) - 2 -> num genes* 4. 
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The best leave-one-out performance Is 100% accuracy for SVMs and only 
90% for Goiub s meit <& (6 errors'! The methods of the present invention 
provide b*He results tha tat S , , > j ith a 99,3% 

confidence rate, 

5 The optimum number of gem ; p edit e I {?> ihe leave-one out criterion is 

32 genes (per FIG, IS). In Table 10, the "muscle index" values of these ? 
support vectors are provided. The muscle index is a quantity computed by A ion 
et ah on ail samples that reflects the muscle cell contents of a sample. Most 
normal samples have a higher tnuscte index than do tumor samples However. 

10 the support vectors do not show any such trend. There Is a mix of normal and 
cancer samples with either high or low muscle index. 

More importamiy, an analysis of the genes discovered reveals that the .fast 
smooth muscle gene ranks 5 for Goiub's method and only 4! for SVMs. 
Furthermore, the optimum number of genes using SV.M prediction is 32 genes on 

15 a log plot and 21 genes on a linear plot. Therefore, SVMs are able to avoid 
relying on tissue composition-related genes to perform the separation. As 
confirmed by biological data, the top ranking genes discovered by SVMs are a!! 
related to cancer vs< normal separation. In contrast, (Mob's method selects genes 
that are related to tissue composition and not to the distinction of cancer vs. 
20 norma! in its top ranking genes. 
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Table 10: Muscle index of the support vectors of she SV M trained on the top 7 
genes selected by SVM-RFE. Samples with a negative sign are tumor tissues. 
Samples with positive signs are normal tissues. Samples were ranked in ordered 
of increasing muscle index. In most samples in the data set, normal tissues have 
higher muscle index than tumor tissues because tumot tissues are richer in 
epithelial (skin) ceils. Tins was not the case for support vectors which show a mix 
of all possibilities. 



Table 1! provides the severs top ranked genes discovered by the SVM- 
RFE of the present invention and the genes thai clustered to them at threshold 
0~0.?S. The same information is provided for uohib's method in Tabie 1.2 
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Table 1 1: SVM top ranked dusters when using al! 62 tissues. Clusters are 
bu5if af0iind ;he with threshold 9 » 0.75. The higher the dus;er rank 

(Rk), the more "rdevtmC the duster should be. Mm ccrre; is the- nnnijnum 
S com u-n c« t % u c ust < Uj t-, S^ii pfl i > r at; i « th j 
srpr separation, - for over-expressed irj most cancer tissues; -t- for over- 
expressed in mosi norasai tissues; CAN. Gene Accession Number. The cluster 



wo immm 



75 



centers are preceded by a star. None of the genes seem to be tissue composition 
related. 
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Tabte 12: Golub top ranked clusters when using af! 62 tissues. Ousters are 
built around the best genes with threshold 9 a 0.75. The higher the duster rank 
(Rk), the snore "relevant" the cluster should be. Mil) eotrel is the nnromum 
5 orreiaiioi 1 k * veen clus 1 m ! 9g gn < * - < < i' t v < 
target separation, - for over-expressed in most cancer tissues; + for over- 
expressed in most norma! tissues; CAN: Gene Accession Number. The cluster 
centers are preceded by a star. Some of the genes may be tissue composition 
related. 

10 

As a feature selection method, SVM-RFE differed from Oolub's method 
in two respects. First, the mutual information between features was used by 
SVMs white Goiuh's method makes implicit independence assumptions. 
Second, the decision function was based oniy on support vectors that are 
15 "borderline" cases as opposed to being based on ail examples in an attempt to 
characterize the "typical" cases. The use of support vectors is critical in factoring 
out irrelevant tissue composition-related genes. SVM-RFE was compared with 
RFE methods using other linear discriminant functions that do sot make 
independence assumptions bu srtetirf i m s Two 

20 discriminant functions were chosen; 

Fisher linear discriminant also called Linear Discriminant Analysis (IDA) 
(see e.g. Duda, 1973) because Golub's method approximates Fisher's linear 
discriminant b\ making independent assumption , and 

Mean-Squared-Error (MSB) linear discriminant computed by Pseudo- 
25 inverse (see e.g. Duda, 1973) because when all training examples are support 
vectors, the pseudo-inverse solution is identical to the S VM solution. 

The tesuits of comparison of feature (gene; selection methods fat colon 
cer da vied in FIG. 30. ? ws the seiec ptim 

number of genes for colon cancer data. The number of genes selected by 



wo mmm 



rcr/vsomms 



78 



Recursive Feature Elimination (RFE) was varied and was tested with different 
methods. Training was done on the entire data set of 62 samples. The curves 
represent the leave one-out success rate. The different methods are shown in 
FIG. 20, with the curves being identified as follows; Circle: SVM-RFE. Square; 
5 Linear Discriminant Anaiysts-RFE. Diamond: Mean Squared Error (PSeudo- 
iftverse)-RFE. Triangle: Baseline method (Golub, 1999). SVM-RFE provides 
the best results down to 4 genes... Art examination of the genes selected reveals 

iVM " t it gene the a tissu c cp ition-resate and keep 
genes that are relevant to the cancer vs. normal separation. Conversely, other 
10 methods retain smooth muscle genes in their top ranked genes which aids in 
separating most samples, but is not relevant to the cancer v$. normal 
discrimination. 

All methods that do not make independent assumption;; outperform 
Golub's method and reach 100% leave-one-out accuracy for at least one value of 

15 the- number of genes. LDA may be at a slight disadvantage on these plots 
because, for computational reasons, RFE was used by eliminating chunks of 
genes that decrease in size by powers of two. Other methods use RFE by 
elirm -sating one gene at a time, 

Down to 4 genes, SVM-RFE showed better performance than all the other 

20 methods. All methods predicted with the criterion of the equation: C = Q ~ 2 
grid}; an optimum number of genes smaller or equal to 64. The genes ranking ! 
through 64 for ail the methods studied were compared. The first gene that was 
wlated to tissue composition and mentions "smooth muscle" in its description 
ranks 5 for Goiab's method 4 for IDA, I for MSE and only 41 for SVM. 

25 Therefore, this was a strong indication that SVMs make a bet ter use of the data 
compared with other methods since they arc the only methods that effectively 
factors out tissue composition-related genes white providing highly accurate 
separations with a small subset of genes. 

FIG. 18 i$ a plot of an optimum number of genes for evaluation of colon 

30 cancer data. The number of genes selected by recursive gene elimination with 
SVMs was varied. The curves are identified as follows: Circle, error rate on -he 
test set. Square: scaled quality criterion fQ/4), Crosses, scaled criterion of 
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opHmahsy {C/4). Diamond curve: result of locally smoothing the C/4. Triangle: 
scaied theoretical error bar (g/2). The curve;-; arc related by C-Q-2e. 

node riterion was establisl - ; its 

'predictive power was correlated by using it on colon cancer data, without making 
5 any adjustment. The criterion also predicted the opsimum accurately. The 
performance was not as accurate or. the first trial because the satne preprocessing 
as for the leukemia data of Example 2 was used. The results -were improved 
substantially by adding several preprocessing steps and reached a success rate of 
90& accuracy. These preprocessing steps included taking the logarithm of all 
10 values, normalising sample vectors, normalizing feature vectors, and passing die 
result through a squashing function to diminish the importance of outliers 
ikx) »cantan ix/c)j Nonrafearion comprised subtracting the mean over all 
training values and dividing by the < ne t ig standard devi a Uon. 

The model selection criterion was used in a variety of other experiments: 
15 using SVMs and other algorithms. The optimum number of genes was always 
predicted accurately, within a factor of two of the number of genes 

The results of the SVM-RFE analysis are confirmed in the biology 
literature. The best ranked genes code for proteins whose role in colon cancer has 
been long identified and widely studied. Such is the case of CD44, which is 
20 upregul&ted when colon adenocarcinoma turner cells transit to. the metastatic state 
{China, 1998} and collagen which is involved in ceil adhesion. Colon carcinoma 
cells have collagen degrading activity as part of the metastatic process 
{Katakiulakis, 1997). ATP synthase as an enzyme that helps build blood vessels 
to feed the tumors was published only a year ago (Mo;ter,i999). Diminished 
25 status of folate has been associated with enhanced risk of colon cancer in a recent 
clinical study {Walsh, 1999), To this date, no known biochemical mechanism 
explains the tole of folate in colon cancer. Knowing that gene H648D7 (Placental 
folate transporter) was identified as one of the most discriminant genes in the 
colort cancer vs\ normal separation shows the use of the methods of the present 
30 invention for idem , genes invol 1 in b ica! changes 

in the case of human chitotriosidase, one needs to proceed by analogy 
with another h ci . a r < >1 • u i) vhose role i n another cancer 

is under study another chitinase (8RP39) was found to play a role in breast 
cancer. Cancer cells overproduce this chitinase to survive apoptosis (Aronson, 
35 99 imports icrea-sed chi f dose activity i seer, n t;coc i ditties 



studies of Gaiuchers disease patterns,-' an apparently unrelated condition, -To 
u a r. K' presence of that d id yu > \ \ 

sensitively measured- The plasma or serum prepared from less than a droplet of 
blood is highly sufficient for the chitoiriosidase measurement (Aerts, 1996), 
5 The 60S ribosoTnai protein .2 (Ai hx! , thaltana) is a n utman 

>tE t 1 >it h hi -> rm 6 Like 

other ribosoroai proteins, it may play a role its controlling cell growth and 
proliferation through the selective translation of particular classes ofroRNA, 

A surprisingly novel finding is the identified gene for "procyelic fonrt 
10 specific polypeptide BLaipha precursor (Trypanosoma bmeei brueei)'*. 
Trypanosoma- is a parasitje protozoa tadig?nous to Africa and South America and 
patients infected by Trypanosoma (a colon parasite) develop resistance against 
colon cancer (Oliveira, 199$) . Trypanosomiasis is an ancient disease of humans 
and animals and is still endemic in Africa and South America, 

15 

EXAMPLE 2 

Leukemia gene discovery 

The data set, which consisted of a matrix of gene expression vectors 
obtained from DMA microarrays, was obtained from cancer patients with two 

20 different types of leukemia. After preprocessing, it was possible to find a 
weighted sum of a set of only a few genes that separated without error the entire 
data set, thus the data set was linearly separable. Although tire separation of the 
data was easy, the problems present several features of difficulty, including small 
sample sizes and data differently distributed between training and test sef, 

25 In Oolub, 1909, the authors present methods for analyzing gene 

expression data obtained from DNA micro-arrays in order to classify types of 
cancer. The problem with the leukemia data was the distinction between two 
variants of leukemia (ALL and AML). The data is split into two subsets: A 
training sai, used to select genes and adjust the weights of the classifiers, and at! 

30 independent test set used to estimate the performance of the system obtained. 
Goiub's training set consisted of 38 samples (27 ALL and 1 1 AMI,} from hone 
marrow specimens. Their test set has 34 samples (20 ALL and 14 AML), 
prepared under different experimental conditions and including 7:4 bone marrow 
and 10 blood sample specimens. All samples have 7129 attributes (or features) 

35 corresponding to some normalized gene expression value extracted from the 
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mfcro-array image. In this Example, the exact same experimental conditions 
last i c e c ; , i > hi v 

\s su v re performed From 

each gene expression value the meat va u >* t ^ ;d and the - was d ded 
5 by us standard leviation, REE method was used and chunks of genes were 
eliminated at a time. At the first iteration, a number of genes were reached that 
xva. the closest tK>we T of 2 \ s «> half of the remait nj genes 

were eliminated. Nested subsets of genes were obtained that had increasing 
information density. The quality of these subsets of genes was then assessed by 

10 training various classifiers, including a linear SVM, the Golub es: al. classifier and 
Fisher's linear discriminant. 

In preliminary experiments, some of me large deviations between leave- 
one-oat error and test error could not be explained by the small sample size atone. 
The data analysts revealed thai there are significant differences between the 

15 distribution of the training set and the test set, Various hypotheses were tested 
and it was found that: the differences can be traced to differences in data source. 
Its ail the experiments, the performance on test data from the various sources was 
followed separately. The results obtained were the same, regardless of the 
source. 

20 In Ctolub, the authors use several metrics of classifier quality, including 

error rate, rejection rate at fixed threshold, and classification confidence. Each 
value is computed both on the independent test set and using the leave-one-oui 
method on the training set. lite teave-one-out method consists of removing one 
example from the training set, constructing the decision function on the basis 

25 onlv i i ,ump'e In 

this method, one tests al! examples of the training data and measures the fraction 
of errors over me total number of training examples. See FIG 22 which shows 
the metrics of classifier quality, The curves (square and triangle) represent 
example distributions of two classes: class t (negative ciass) and class 2 (positive 

30 class). 

Square; Number of examples of class I whose decision function value is 
larger than or equal too. 



wo nmnn 



PCS7US<>2/»22« 



82 



Tnangie: Number of examples of class 2 whc.se decision function value is 
m&llet than or equal to 0. The number of error;; 81 and .82 are the ordinate's of 
6=0. The number of rejected examples RI and 82 are the ordt nates of -~% and 8, 
m the triangle and circle curves respectively. The decision function value of the 
5 rejected examples is smaller titan. 0„ in absolute value, which corresponds to 
examples of i 0 b t set such thai ail 

the remaining "accepted" examples are well classified. The extremal margin E is 
the diff rence b tween ihe smal» 1 ''.mi value 4 class 2 exa lplcs 

and the largest decision function value of class 1 examples. On the example of 

10 the figure, E is negative. If the number of classification error is Zero, E is 
positive. The median margin M is the difference between the median decision 
function value of the class 1 density and the median of she class 2 density. 

In genet a!, several cross tests were performed with the baseline method to 
compare gene sets and classifiers, SVMs trained on SVM seleoied genes Or on 

15 baseline genes, and baseline classifier trained on SVM-sclected genes or on 
baseline genes. Baseline classifier refers to the classifier of equation 4, hereinm, 
(Golub, 1999). Baseline genes refer to genes selected according to the ranking 
criterion of Equation 4 iH>), herein. 

First, the Ml set of 7129 genes (Table 13) was used. The measured values 

20 were as described earlier. 
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Table 13; Results of training classifiers on all genes (Leukemia data). 

A set of 50 genes was then selected. The 50 genes corresponded to the 
5 largest weights of the SVM trained on all genes, A new SVM was trained on 
these 50 gene*. The results were compared with the baseline system {rained with 
she original set of 50 features reported in the GoSubet ai. paper. See Table 34. 
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Tabled 

Results of training on 50 genes (Leukemia data). 

15 in both cases, SVMs matched the performance of ihe baseline system ot 

outperformed it. Using the detailed results of Tabic* 10 and 11, the- statistical 
Significance of ihe performance differences was checked with the follow trig 
equation: 
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(1^)^0.5 + 0,5 erf(zr; /sqrt(2)) 
?T} s,6.:f/sqr^v)' 
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5 Table 15: Detailed results of training on all genes (Leukemia data). 
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Table 1$: Detailed results of training on 50 genes{Leukemia data) 

10 According to the results of the test, the classifiers trained on 50 genes are 

better ihan those trained on ail genes with high confidence (based on the error 
rate 91 J% confidence for Goiub and 98.7% for SVM). Based m the error rate 
alone, the SVM classifier is not significantly better than the Goiub classifier (50% 
confidence, on all genes and 84 i% confidence on 50 genes). Bui, based on the 

15 rejections, the SVM classifier is significantly better than the Goiub classifier 
(99.9% confidence on ail genes and 98.7% confidence on 50 genes). 

A more in-depth comparison between the method of Goiub ei at and 
SVMs on the leukemia data was made. In particular, two aspects of the problem 
were de-coupled: selecting a good subset of genes and finding a good decision 

20 function. The performance improvements obtained with SVMs can be traced to 
the SVM feature (gene) selection method. The particular decision function that 
was trained with these features mattered less than selecting an appropriate subset 
of ge nes. 

Rather than ranking the genes once with the weights of an SVM classifier 
25 according to the naive ranking discussed above, instead, the Recursive Feature 
Elimination (RFE) method was used. At each iteration, s new classifier is trained 
wt <K i i i x \-.«rrm 
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the new classifier is eliminated. The order of elimination yields a particular 
ranking. By convention, the last feature to be eliminated is ranked first. Chunks 
of genes were eliminated at a rime. At the first iteration, the number of genes 
which is the closest power <sf 2 were reached. At subsequent iterations, half of the 
5 remaining genes were eliminated. Nested subsets of genes of increasing 
informative density were obtained. 

The quality of these subsets of genes was then assessed by training 
various classifiers, including a regular SVM, she Golub et cd classifier, .and 
Fisher's linear discriminant (see &g. (Duda, 1973)}. An SVM trained after 
10 projecting the data along the first principal component of the training examples 
seas also used. This amounts to setting a simple bras value, which was placed at 
the center of gravity of the two extreme examples of either class, weighted by the 
number of examples per class. This -classifier was called a"reduced-caj»dty- 
SVM" ("RC-SVM"). 

15 The various classifiers that were tried did not yield . ; i ,{ m'y different 

performance. The resales of the classifier of Golub, 1999 and the reduced- 
capacity-SVM were reported herein. Several cross tests were performed with the 
baseline method to compare gene sets and classifiers. See FIG. 23A which show 
SVMs trained on SVM selected genes or on baseline genes and FIG. 23 B which 
20 shows a baseline classifier trained on SVM selected genes or on baseline genes. 
Classifiers have been trained with subsets of genes selected with SVMs and with 
the baseline method on the training set of the Leukemia data. The number of 
genes is color coded and indicated in the legend. The quality indicators are plot 
radially: channel 1-4 « cross-validation results with the leave-one-out method; 
25 channels 5-8 - w< set r.>i:'r •• -.access rate; ace = acceptance rate; ext ^ 
extremal margin; med « median margin. The coefficients have beets reseated' 
such that the average value of each indicator has Kerb mean an variance 1 across 
all four plots. For each classifier, the larger the colored ares, the better the 
classifier. The figure shows that there is no significant difference between 
30 classifier performance on this data set, but there is a significant difference 
between the gene selections. 

It) Tabic }?. the best results obtained on the test set for each combination 
ji <ene selection nd ela » \ - - t d The classif $ give 

identical results, giver, a gene selection method. In contrast, the SVM selected 
35 gene <> sisrem 5 r pcrf rmanc is e baseline gan« 
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classifiers Fhe signif was tested with the statistical 

equation used herein. 

Whether SVM or baseline classifier, SVM genes were better with 84.1% 
confidence based on test: error rate and 99.2% based on the test rejection rate. 

5 
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10 

Table 17: Best classifier on test data (Leukemia data). The performance of the 
classifiers performing .best on test data are reported. For each combination of 
SVM or Baseline genes and SVM or Baseline classifier, the corresponding 
number of genes, the number of errors and the number of rejections are shown in 
15 the table The patient id numbers ate shown in bracket. 

To compare the top ranked gene*, the fraction of common genes in the 
SVM selected subsets and the baseline subsets (Table 1 8) were computed At the 
optimum number of 16 genes or less , at most 25% of the genes were common. 
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Table 18? Fraction of common genes between she sets selected with the baseline 
method and SVM recursive gene elimination (Leukemia data). The fraction of 
(Sdmmi ^ „ K s j function of the 

5 number of genes (linearly in a log scale). Only 39% of the genes were common a! 
the optimum SVM gene set number 1& 

FIG. 24 shows the best set of 16 genes for the leukemia data. In matrices 
(it) and (c), the columns represent different geties and the lines (rows) different 

10 patients from' the training set. The 2? top lines are ALL patients and the ! 5 
bottom hues are AML patients. The gray shading indicates gene expression: the 
lighter the stronger. FIG. 24A shows SVM best 36 genes. Genes are ranted from 
left to right, the best one at the extreme left. Ail she genes selected are more 
AML. correlated. FIG. 24B shows the weighted sum of the 16 SVM genes used 

15 to make the classification decision. A very cleat ALL'AML separation is shown. 
FIG. 24C shows baseline method 16 genes. The method imposes thai half of the 
genes are AML correlated and half are ALL correlated. The best genes are in the 
middle. FIG. 24D shows she weighted sum of the 36 baseline genes used to make 
the classification decision The separation js still good, but not as good as the 

20 SVM separations 
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FIGS, 24A and 24C show the expression values tor the patients in the 
training scr of the 16 gene subsefs. A: fust sight, the genes selected by the 
baseline- meth iuse shey were- st ong.J) 

correlated with either AMI, or ALL There was a lot of redundancy in this gene 
5 set. fti essence, all the genes carried the same information. Conversely, the SVM 
selected genes carrying complementary information. This was reflected in the 
output of the decision function (FIGS. 24B and 24D) which was a weighted sum 
of the 16 gene expression values. The SVM output more clearly separated AML 
patients from ALL patients. Tables 19 and 20 list the genes that were selected by 
10 the two methods. 
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Table 19; Top ranked i6 SVM genes (Leukemia data). Rk=rank« GAN=Gene 
Accession Number. Correlation --gerte correlates most with the class listed. The 
genes were obtained by recursively eliminating the least promising genes. Nested 
5 subsets of genes are obtained. 
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Table 20: Top ranked J. 6 baseline genes (Leukemia data). GAN*Gme Accession 
Number. CorreJ;mon=ge>)e correlates most with the class listed. The S genes on 
the left correlate most with ALL and the 8 genes on the right with AML, The top 
5 ones are the test candidates. Goiub et ai mixed equal proportions of ALL- 
cone-sated and AML-eotrelated genes in their experiments. 

AN OF11M0M SUBSET OF-OENES CAN BE PREDJCfBD 

The pit b'ern hct ng an optimum subse! of genes was addressed. 

10 The criterion defined in the equation below derived from training examples only 
was used. 

C«Q-:2; E <d) 

Whether the predicted gene subset performed best on the test set was 
checked. The. tests were carried out using SVM-RFE. The number of features 

15 was reduced progressively by a factor of two at every iteration. An SVM 
classifier was trained on ail the intermediate subsets fend. 

As shown in FfG. 25, an optimum number of 16 genes was found. The 
number of genes selected by recursive gene elimination with SVMs was varied. 
The description of the Sines of the graph is as follows: Circle; error rate on the 

20 test set. Square: scaled quality criterion {Q/4}. crosses- scaled criterion of 
optimally (C/4). Diamond curve; result of locally smoothing the D4. Circle: 
seated theoretical error bar (s/2). The curves ate related by The dashed 
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Sins indicates (he optimum of the diamond carve, which is the theoretically 
predicted optimum, based on training data only: 2=16 genes. Zero test error is 
obtained at shis optijuuin. 

The perfDmiance on the test set was also optimum at that value. The 
5 derails oi the results arc reported in Table 2; . 
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Table 21; SVM class 0... trained on SVM ge»«s obtained with the RPfc meth sd 
oakemh data). The c t c si! si s r u i 
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Q minus the error bar k\ These quantities were computed based on training data 
only. The success rate (at zero rejection), the acceptance rate (at xero error), the 
extreme margin and the median margin were reported for the leave-oae-out 
method on the 3S sample iranm^ set >.V r-,c r,. • — U s,.mp<. test set (T 
5 results). Where the number of genes was 16 was the best classifier predicted by 
i '< tily sn thed < riterior u I > > 

At the optimum, the SVM is 100% accurate on the test set. without any rejection. 
Comparison results ish the t 1 1 * > c -. r optimum are 

10 shown in Table 22. 
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15 

Table 22: Best classifier selected with criterion C (Leuktsmia date). The 
performance of the classifiers corresponding to the optimum of criterion C. 
computed solely on the basis of training examples, were reported. For each 
combination of SVM or Baseline genes and SVM or Baseline classifier, the 
20 corresponding number of genes, the number of errors and the number of 
rejections are shown in the table. 

The overall difference obtained between the SVM system (optimum SVM 
classifier trained on SVM features) and the baseline system (optimum base-line 
25 classifier trained on baseline features) was quite significant: 95 8% for the error 
rate and 99.2% for the rejection rate. From cross -test analysis, it was seen that 
these differences can be traced mostly to a better set of features rather than a 
better classifier. 

On ivUx.. i.j <.' >..- ^ ... v „< ,. , .. j 1, h 

30 wire data setoff ample e four t artkr £ h - Se2 
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Table 23, Top ranked genes (Leukemia data). The entire data set of 72 samples 
30 was used to select genes with SVM RFE. Genes were ranked it! order of 
increasing importance. The first ranked gene is the last gene left after all other 
genes have been eliminated. Expression: ALLxAML indicates that the gene 
expression level is higher in most ALL samples; AML>ALL indicates, that the 
gene expres t f t t AML samples, GAN; C \ r 

35 Number. All the genes in this list have some plausible relevance to the AMI, vs. 
ALL separation. 

The number of four j t sm number of support 

vectors (5 in this case). Alt four gene:, have some relevance to leukemia cancer 
40 and can be used for discriminating between AML and ALL variants. 

In this last experiment, the smallest number of genes mat separate the 
whose data set without error is two. For this set of genes ; there is also zero Seave- 
oae-oa? error. In contrast, Goiahs method a-'v-^v-, % elds it ea 1 c c training 
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error and one leave-one-out error. One training error can be achieved with a 
minimum of 16 genes and one leave-orse-out error with a minimum of 64 genes. 

5 EXAMINE 3 

Isoiatioa of genes involved with Prostate Cancer 

ng the ) d$ disclosed beti a t i < i< net 

were isolated. Various methods of treating and analyzing the ceils, including 

10 SVM, were utilized to determine the most reliable method for analysis, 

'tissues were obta I i tticnts that had cancer and underwent 
prostatectomy. They were processed according to a standard protocol of 
Asymetrix and gene expression values from 7129 probes on the Asymetrix. 
Gene Chip were recorded for 67 tissues from 26 patients. 

15 The samples collected included tissues from the Peripheral Zone (PZ); 

Central Zone (CZ) and Transition Zone (TZ). Each saropfe potentially consisted 
of four different ceil types: Stomal ceils (from the supporting tissue of the 
prostate, not participating in its function); Normal organ cells; Benign prostatic 
hyperplasia cells (BPH); Dysplasia cells (cancer precursor stage) and Cancer cells 

20 {of various grades indicating me stage of the cancer). The distribution of the 
samples in Table 24 reflects the difficulty of getting certain types of tissues- 
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Table 24 Distribution of samples. 

I( has been argued in the medical literature thai TZ BPH could serve as a 
25 good reference for PZ cancer. The highest grade cancer <G4} ;s the most 
malignant. Pan of these experiments am therefore directed towards the 
separation of BPH vs. C4 
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> cells were pt sin r c 1 ro ■ < it*' 

which was used to eliminate as much of the supporting stromal cells as possible 
and provides purer samples, 

G expr n > « torn t if t mRN.-* he cell 

The mR-NA is converted into cD\ \ and ampiifi >• t c < < 

DCf Sir 1 s » i i < j 

or two amplifications may be necessary. The amplification process way distort 
the gene expression pattern. In the data set under study, other 1 or 2 
amplifications were used. LCM data always required 2 amplifications. The 
treatment of the samples is detailed in Table 25. 
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TabJ« 25 

Ibe end result of data extraction is a vector of 7129 gene expression 



Probe oak s; 

Gene expression measurements require calibration. A probe cell (a square 
on the array) contains many rephca > f .-ante o! n. tid (probe) that is 
a 25 bases long sequence of DN'A< Each "perfect match" (PM) probe is designed 
to complement a reference sequence (piece of gene). It is associated with a 
"mismatch" (MM) probe that is identical except for a single base difference in the 
central position. The chip may contain replicates of the same PM probe at 
different pos - me PM sobe eonespi ,id ng 

to ih» substitution of one of the four bases Oris ens %l ^ t bes s referred to 
j i beset eeni fxpressio s eulcn ted as 

Average Differe-nos =- j/pasr nitre 
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Data quality 

!f he mag tode th probe pair values s not contrast . 1 
probe pan is t 1 > is are set so accept 01 reject probe 

pairs AfJ*- ,f c t obe r> ! 

5 good quality. Lower qnality samples can also be effects vely used with ihe SVM 
techniques. 

Preprocessing 

A simple "whitening" was performed as preprocessing., This means that 
after preprocessing the data mains resembles "white noise". In the original data 

10 matrix a hnc of the matrix represented the expression values of 7129 genes for a 
given sample (corresponding so a particular combination of 
patiem/nssue/p> pa ui method) A column of the matrix sepresemed the 
expression values of a given gene across the 67 samples. Without normalization, 
neither the Hues nor the columns can be compared. There are obvious offset and 

IS scaling problems. The samples were preproeessed to: normalize matrix columns; 
normalize matrix linesjand normalize columns again. Normalization consists of 
subtracting the mean and dividing by the standard deviation. A further 
normalization step was taken when the samples axe split into a training set and a 
test set. 

20 The mean and variance column-wise was computed for the training 

samples only. Ail samples (training and test samples) were then normalized by 
subtracting that mean and dividing by the standard deviation. 



25 yields more informative data than un filtered tissue samples and whether arrays of 
lower quality contain useful information when processed using the SVM 
technique. 

Two data sets were prepared, one for a given data preparation method 
(subset 1} and one for a reference method (subset 2). For example, method i - 
30 LCM and method 2 « urdotered samples. GoluVs linear classifiers were then 




•e evaluated to determine whether LCM data preparation 
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trained to distinguish between career and normal cases using subset 1 and 
another classifier using subset 2. The classifiers were then rested on the subset ott 
which iitey had not been twined (classifier 5 with subset 2 and classifier 2 with 
subset 15- 

5 If classifier 1 performs better on subset 2 than classifier 2 on subset I, it 

means that subset I contains more information so do the separation cancer vs, 
normal Shan subset 2. 

Tbe input to the classifier is a vector of n "features" that are gene 
expression coefficients coming from one microarr&y expem mt I , lass 

10 are identified with the symbols- (+} and {-} wfsh "normal" or reference samples 
belong to class (>) and cancer tissues to class {-), A training set of a mimbeT of 
patterns {x, x ;) ... x t , ... } with known class labels fy i:> y,, ... y x , „. )> 
y k e }, is given. The (raining samples ate used to build a decision function 

{or discriminant function) D(x). that is a scalar function of an input pastern x. 

15 Ne w samples are classified according to the sign of the decision function: 
D(x) > 0 xe class (+} 
D(x)<0=»xs class {-) 
D(x) = 0, decision boundary. 

Decision functions that are simple weighted sums of the training patterns phis a 
20/ bias are called linear discriminant functions. 

D(X) = W.XJ-b, 

where w is the weight vector and b is a bias value. 

In the case of Golub's classifier, each weigh! is computed as: 
W. » (%(+} - u s (-)) / ( ff .(+) + 0 f (-)) 
25 where (ii and o. are the mean and standard deviation of the gene expression 
values of gene i for all she patients of class i» or class (-), i=L ... n. Large 
positive w ; values indicate strong correlation with class (+) whereas large, 
negative w. values indicate strong correlation with class ;■ )■ Thus she weights can 
also be used to rank the features (penes! according to relevance. The bias is 
30 computed as b=-tv. ( u , where p ~ (p.f» * <.< (-)}/2. 
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Gotab's' classifier is a standard reference that is robust against outliers.. Once a 
first classifier is trained, the magnitude of % is used to rank the genes. The 
classifiers ate then retrained with subsets of genes of different siz.es, including the 
best ranking genes. 

5 I'o assess the statistical i t mceofth* esul ts en random splits of the 

data inciudmg samples were prepared from either preparation method and 
submitted to the same method. This allowed the computation of at) average and 
standard deviation for comparison purposes, 
importance of LCM data preparation 
10 Tissue from the same patient was processed either directly (is nil he red) at 

after the LCM procedure, yielding ;» pair of micro-array experiments. This 
yielded 13 pairs, including: four G4; one G3+4;two 03; four BPH;one CI 
(normal} and one PZ (normal). 

For each data preparation method (LCM or uni'iitercd tissues), the tissues were 
15 grouped into two subsets; 

Cancer « G4+G3 (7 cases) 
Normal * BPH+CZ* PZ (6 cases), 

The results are shown in HO. 26. The large error bars are due to the 
small size. However, there is. an indication that LCM samples arc better than 

20 anftltcred tissue samples. It is also interesting to note that the average curve 
corresponding to random splits of the data is above both curves. This is not 
surpri ^ h 1 . n lands j. 1 f • < ' i ted. V 

making a random split rather than egreg ttins . rr pi - , both LCM and unfiliered 
tissues are represented in the ts i a ce on the test 

25 set are better on average. 

|m^{S<U}£g i>Lana>' Quality as measured by AfffflMfflfo 

The same methods were applied to determine whether microairays with 
gone expressions rejected oy the Affymeatt quality coteries contained useful 
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information by focusing on the problem of .separating BPH tissue vs. 04 tissue 

with s totai of 42 arrays (18 BPH and 24 04). 

The Affymetrix criterion identified I? good quality arrays, 8 BPH and 9 

G4, Two subsets were formed: 
5 Subset ! * "good" samples, S BPH + 9 04 

Subset 2 = "mediocre" samples, TO BPH + 15 G4 

For comparison, all of the samples were lumped together and 10 random 

subset I containing 8 BPH + 9 G4 of any quality were, selected. 'The remaining 

samples were used as subset 2 allowing an average curve to be obtained. 
10 Additionally the subsets were inverted with training on the "mediocre* examples 

and testing on the "good" examples. 

When the mediocre samples are trained, perfect accuracy on she good 

samples is obtained, whereas training on the good examples and testing on the 

mediocre yield substantially worse results, 
15 Ail the BPH and 04 samples were divided into LCM and onfiltered tissue 

subsets to repeat similar experiments as in the previous Section: 

Subsetl * LGM samples (5 BPH + 6 LCM) 

Subset2 » untutored tissue- samples (13 BPH +• IS LCM) 

There, in spite of the difference in sample size, training on LCM rfaia yields 
20 better results, fit spite of the large error bars, this is an indication that the LCM 

data preparation method might be of help in improving sample quality. 

BPH vs. 04 

The Affyme-tnx data quality criterion were irrelevant for the purpose of 
determining she predictive value of particular genes and while the LCM samples 
25 seemed marginally better than the unfdtered samples, it was not possible to 
determine a statistical significance. Therefore, all samples were grouped together 
and the separation 8 HP vs. 04 with ail 42 samples (IS BPH and 24 04) was 
preformed. 



WO 02/059822 



P€T/t;S02/«2243 



100 

To evaluate performance and compare Golub's method wish SVMs, the 
ieaye^one-Qiit method was used. The fracfion of successfully classified left-out 
examples gives an estimate of the success rate of the various classifiers. 

I« mis procedure, the gene selection process was tun 41 times to obtain 
S ts cues ious size r 1 mk.in OnecS Her was f 

trained on the correspond U >, tor every subset of genes. This Ieave-one- 
out method differs from the "naive" Ieave-one-out that consists of running the 
gene selection only once on all 41 examples and Then training 41 classifiers on 
ever) subset of genes The rah - rt» lh< d givt < * >j ti r s-t < result h * uise 

10 ail the examples are used in the gene selection process, which is like 'training on 
the test set" The increased accuracy of the first method is illustrated in BG, 27. 
The method used in the figure is SVM-SFfi and the classifier used is ah SVM. 
Ail SVMs are linear with soft margin parameters C- 100 a n d talO"". The dashed 
line represents ihe "naive" leave-one-out (loo), which consists in running the 

13 gene selection once ami performing loo for classifiers using subsets of genes thus 
derived, with different sixes. The solid line represents the mote computationally 
expensive "true" loo, which consists in running the gene selection 41 times, for 
every left out example. The left out example is classified with a classifier (rained 
on the corresponding 40 examples for every selection of genes If f is the success 

20 Kite obtained (a point on the curve), the standard deviation is computed as 
sqrt(f(l-f)). 

Comparison between SVMs and Golub's nyythnd 

The "true" ieave-one-out method was user! to evaluate both Golub's 
method and SVMs. The results are shown in FIG. 28. SVMs outperform 
23 Golub's method for the small number of examples. However, the difference is 
not statistically significant in a sample of this size (i error in 41 examples, only 
85% confidence that SVMs are better). FIG. 29 depicts the decision functions 
obtained for the two best ranking genes wuh either method . 
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The gene section was then run for both methods on all 41 samples. 
Many of the 'op rank n - related ion r > t in a 

bibliographical search. 



Analyzing Small Data sets with Multiple Features 

Small data sess with large numbers of features preseat several problems, 
in order fa address ways of avoiding data overfitting and to assess the 

10 significance in performance of multivariate and univariate methods, the samples 
from Example 3 which were classified by Affymetrix as high quality samples 
were further analyzed. The samples include 8 BPH and 9 G4 tissues. Each 
microarray recorded 7129 gene expression values, The methods described hereto, 
can use the 2/3 of the samples in the BHP,G4 subset which were considered of 

15 inadequate quality for use with standard methods. 

The first method is used to solve a classical Machine Learning problem, if 
only a few tissue examples are used to select best separating genes, these genes 
are likely to separate well the training examples but perform poorly on new, 
unseen examples (test examples). Single-feature SVM, described herein, 
20 performs particularly well under these adverse conditions. The second method is 
used to solve a problem of classical statistics and requires a test that uses a 
combination of the McNemar criterion and the Wileoxon lest. This test allows 
the comparison of the performance of two classifies trained and tested on random 
splits of the data set into a training and a test set. 



The methods of classifying data has been disclosed elsewhere and is 
repeated herein for clarity. The problem of classifying gene expression data can 
be formulated as a classical classification problem where the input is vector, a 
30 "pattern" of e eomponenss cttfei "features". F is the n-dijneijsicnd feature 
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EXAMPLE 4 
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space, & the case of the problem at hand, the features are gene expression 
coefficients and pasterns correspond to tissues. This is limited to two-class 
elassificarion problems. The two classes ate identified with the symbols (+) and 
<-)• A training set of a number of patterns {>.;. x : , ... x kl ... *„} with known class 
5 labels (y. y 2> .. y k , .. y p }, y fc € {-}.*{}. is given. The training set is usually a 
subset ot she entire data >e f some pa' , 1 - r tn ? f ' r e , Lhe train in 
patterns are used to build a decision function (or discriminant function) D(x), that 
is a scalar function of an input pattern x. New patterns (e.g. from the test set) are 
eiassi edaccordut o tlx ;igs ion function: 

10 D(x)<0=*xe class (-) 

D(x)>0=* x e class (+) 
D(x) =0, decision boundary, 
decision functions that are simple weighted stints of the training patterns phis a 
bias are called linear discriminant functions, 

15 Dtx) = w-x+b, (I) 

where w is the weight vector and b is a bias value. 

A data set such as the one used in these experiments, is said to be 
"linearly separable" If a linear discriminant function can separate it without error. 
The data set under study is linearly separable. Moreover, there exist single 
20 features (gene expression coefficients) that alone separate the emits data set. 
This study ts limited to the use of linear discriminant functions. A subset of 
linear discriminant functions are selected that analyze data from different points 
of view: 

One approach used multivariate methods, which computed every 
25 component of the weight w on. the basis of all input variables (ail features), using 
the training examples. For -multivariate methods, it does not make sense to 
intermix features from various rankings as feature subsets are selected for the 
complementarity of their features, not for the quality of the individual features. 
The combination is then in selecting the feature ranking that is most consistent 
30 with /I other tanking i.e. contains ;n .is -op a, t , ^ features the highest density 
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><■ . i • h her feati kings. Two such a d 

were selected: 

1JM: Linear Diterimiaant Analysis, also called Fisher's linear 
diserirmnam (see e.g. (Duda, 73)). Fisher's linear discriminant is a method that 
5 w J he direction of proj - j s < 

the between class variance over the within class variance. It is an "average case" 
method since w is chosen to maximally separate the class cerstroids. 

M " ■ . i neat Support Vector 

Machine (linear SYM). The optimum margin classifiers seeks forw the direction 
10 of projection of the examples thai maximizes the distance between patterns of 
opposite classes thai are closest to one another (margin). Such patterns att caiied 
support vector. They solely determine the weight vector w. It is an "extreme 
case" method as w is determined by the extremes or "borderline" cases, the 
support vectors, 

15 A second approach, multiple univariate methods, was also used, Such 

methods computed each component w: of the weight vectors on the basis of the 
values that the single variable x; takes across the Gaining set. The ranking 
indicates relevance of individual Features. One method was !o combine rankings 
to derive a ranking from the average weight vectors of the classifiers trained on 
20 different training sets Another method was to firs! create the rankings from the 
weight vectors of the individual classifiers. For each ranking, a vector is created 
whose comp*. e s, % eciors are then averaged 

and a new ranking is derived from this average vector. This last method is also 
applicable to the combination of rankings corning from different methods, not 
25 necessarily based on the weights of a classifier. Two univariate methods, the 
equivalents of the multivariate methods were selected; 
SF-LDA: Single Featare Linear 
Discriminant Analysis- 

ws » - ^(-jysqrtCpC+.miC+^+p^.}*) (13! 

30 

SF-SVMc Single Feature Support 
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Vector Machine: 
ws * (si(+)~Si(-), if sign (s, ;+ ,s;(-))-sigr! 
W; = 0 otherwise. 



04} 



5 The parameters ftj and a\ are she mean and standard deviation of the gene 



and p(-) are the number of ex ampie$ of class t>) or class {-). 

The single feature Fisher discriminant (SF-LDA) bears a lot of 
resemblance with the method of Gbiub et al (OoSub, 1999). This last method 
10 computes the weights according so w; • (#{+) - / Oi(+)4oj(.)). The two 
methods yield similar results. 

Feature normaJization played an important rob for the SVM methods. Ml 
features were normalized by subtracting their mean and dividing by their standard 
deviation. The mean and standard deviation are computed on training examples 
15 only. The same values are applied to test examples. This is to avoid any use of 
the test data in the learning process. 

'Ihe bias value can be computed in several ways. For LDA methods, if is 
computed as; b=---{rn(+}+m(-))/2 e where m(+)«w.p(+) and ir(-)=w.K-}. This way, 
the decision boundary is in the middle of the projection of the class means on the 
20 direction of w. For SVMs, it is computed as b««{s(+Hs(-})/2. where s(-f}=mtin 
w.x(+) and sfVMmax w.x(-), the minimum and maximum being taken over all 
training examples x(+) and x(-> 1« class (+) and {-) respectively. This way, the 
decision boundary is in the middle of she projection of the support vectors of 
■either class on she direction of w, that is in the middle of the margin. 



expression values of gene i tor all the 



of class O) or class (-), i«l,...n. p(+) 



25 



The magnitude of the weight vectors of trained classifiers was used to 
rank features (genes). Intuitively, those features with smallest weight contribute 
least to the decision fanes ion and therefore can be spared. 
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Far univariate methods, such ranking corresponds to ranking features 
{genes) iadividuafly according so their relevance. Subsets of complementary 
ier separai t i nnot be- found with oni 

methods. 

3 Par multivariate methods, each weighs w; is a function of all the features 

of the training examples* Therefore, removing one or several such features 
affects the optimal! ty of the decision function Th d « unction must be 
recomputed after feature removal (retraining). Recursive Feature Elimination 
{RFEj, the iterative process alternating between two steps is; (!) removing 

10 features ant! (2) retraining, Until all features are exhausted. For multiple 
univariate methods, retraining does not change the weights and is therefore 
omitted. The order of feature removal defines a feature ranking or. more 
precisely, nested subsets of features. Indeed, the last feature to be removed with 
RFE methods may not be the feature that by itself best separates she data set. 

15 Instead, the test -2. or 3 features to be removed may form the test subset of 
features that together separate best the two classes. Such a subset is usually better 
than a subset of 3 features that individually tank high with a univariate method. 

Stati^tica — gaj can pe Cpj^ancediffgreBce 

20 For very small data sets, it is particularly important to assess the 

statistical significance of the results. Assume that the data set is split into 8 
examples for training and 9 for testing. The conditions of this experiment often 
results in a I or 0 error on the- test set. A z-test with a standard definition of 
'statistic mifica ifid< used. I nest set of size t=9 and 

25 a true error rate p=]/0, she difference between the observed error rate and the true 
error rate can be as large as 17%. The formula e - ;% 4<jrr(p< i -pVt), where z ; , ~ 
sqrt(2) effim>(-2(i)-(15)). n=0.G5, was used, where trfim is the inverse error 
function, which is tabulated. 

Tm error function is defined as. erfljQ = [ &xp{-i 2 } dt. This estimate 
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.assaiwcs Ud, errors {where the data amd in training and testing were 
independently and identically distributed), one-sided risk and the approximation 
of the Binomial law by the Normal law. This is to say thai the absolute 
performance results (question 1} should be considered with extreme care because 
3 of the large error bars. 

In contrast, it is possible ro compare the performance of two classification 
systems (relative performance, question 2) and, in some cases, assert wife 
confidence that one is better than the other, One of the most accurate- tests is the 
McNemar test, which proved to be particularly we!! suited to comparing 
10 classification systems in a recent benchmark. The McNemar test assesses the 
significance of the difference between two dependent samples when the variable 
of interest is a dichotomy. Wii.h confidence (14!) it can be accepted that one 
Classifier is better than the other, using the formula: 

15 

where t / $qn( v); t is the number of test examples, v is the total number of 
errors (or rejections) that only one of the two classifiers makes, & is the difference 

in error rate, and erf is the error function erflx)a Jexp<>r'} dt. 

This assumes i.Ld. errors, one-sided risk and the approximation of the 
20 Binomial law by the Normal law. The comparison of two classification systems 
and the comparison of two classification algorithms need to be distinguished. The 
first problem addresses the comparison of the performance of two systems on test 
data, regardless of how these systems wete obtained (they might have not been 
obtained by training). This problem arises, for instance, in the quality 
25 comparison of two classification systems packaged in medical diagnosis tests 
ready to be sold, 

A second problem addresses the comparison of the performance of two 
algorithms on a given task, It is customary to average the results of several 
random splits of the data into :> warning set and a test set of a given size. The 
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p»Qj>ft«tan oi tmiaiiJg and tea data are varied and results plotted as a function of 
the (raining set size. Results ate averaged over s~20 different splits for each 
proportion (only I? in the ease of a training set of size 16, since share art: only 17 
examples}. To compare two algorithms, the same data set;; to train and test are 
5 .used with the two algorithms, therefore obtaining paired experiments. The 
Wiieoxon signed rank test is (Ken used to evaluate the significance of the 
difference its performance. The Wilcoxors test tests the null hypothesis two 
treatments applied to N individuals do sot differ significantly. It assumes thai the 
differences between the treatment results are meaningful. The WUcoxon test is 
iO applied as follows: For each paired test i, i=i,,..s, the difference t- s in error rate 
of the two classifiers trained is computed in the two algorithms to be compared. 
The test first orders the absolute values of £ ; the from the least to the greatest. 
Tie quantity T to be tested is the sums the ranks of the absolute values of £ ; over 
ail positive to. The distribution of T can easily be calculated exactly of be 
15 approximated by the Normal law for large values of s. The test could also be 
applied by replacing ei by the normalised quantity e , / iqn( V) ) used in (15) for 
the McNemar test, computed for each paired experiment. In this study, the 
difference in error rate fit is used, Ifie p value of the test is used in the present 
experiments: the probability of observing more extreme val ues than T by chance 
20 if Ho is true; Probacies! StatistioOfeserved T). 

If the p value is small, this sheds doubt: on 11 which states that the 
medians of the paired experiments are equal. The alternative hypothesis is that 
one is larger thats the other, 

25 ae^rofiegsiflg 

The normalized arrays as provided by Affymetrix were used. No other 
preprocessing is performed on the overall data set. However, when the data was 
■split into a training set and a test set, the mean of each gene is subtracted over all 
training examples and divided by its standard deviation. The same mean and 
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standard deviation axe used t< lift a scale tl 1 ' *; No othei 
preprocessing or data cleaning was, performed. 

It can be argued that genes that are poorly contrasted have a very Sow 
$igf»l/noise ratio. Therefore, the preprocessing that divides by the standard 
S deviation just amplifies the noise. Arbitrary patterns of activities across tissues 
can be obtained for a given gene. This is indeed of concern for unsupervised 
learning techniques. Fox supervised learning techniques howcvci, it is unlikely 
that a noisy gene would by eh ; lata and n 

therefore be discarded automatically by the feature selection algorithm, 
10 Specifically, for an over-etxpressed gene, gene expression coefficients took 
positive values for Q4 and negative values for BPH. Values are drawn at random 
with a probability 72 to draw a positive or negative value for each of the 17 
tissues. The probability of drawing exactly the right sips for all the tissues Is 
<V2}'\ The same value exists for an under-expressed gene (opposite signs), Thus 
1 5 the probability for a purely noisy gene to separate perfectly all the BRH from the 
04 tissues is p«2(%2)'Vi .5.|<k5. There are m=7i29-515<MS78 presumably 
noisy genes, if tbey were all just pure noise, there would be a probability (i-p)m 
that none of them separate perfectly all the BPH from the 04 tissues. Therefore a 
probability l-(l-p)m -3% that at least one of them does separate perfectly all the 
20 BPH from the 04 tissues. 

For single feature algorithms, none of a few discarded genes made it to 
ibe top, so the risk is irrelevant For SVM and LDA. [here is a higher risk of 
using a "bad'* gene since gene complementarity is used to obtain good 
separations, not single genes. However, in she best gene list, no gene from the 
25 discarded list made it to the top. 

Data splits 

Simulations resulting from multiple splits of the data sef of 17 examples 
(8 BPH and S 04) into a training se; ana a test set were run. The size of the 
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training set is varied. For each training set drawn, the remaining of the data are 
used for testing. 

For siumber of training examples greater than 4 arid less than 16, 20 
training sets were selected ai random. For 16 training examples, the Jeave-oae- 
5 out method was used, in that all the possible training sess obtained by removing 1 
example,*! a time (17 possible choices) were created. The test set is then of size 
I . Note that the Jest set i s never used as part of the feature selection process, even 
in the case of the leave-ene-out method. 

For 4 examples, at! possible training sets con tat i 2 ex a spies oi each 
10 class (2 BPH and 2 04), wets, created and 20 of them were selected at random. 

For SVM methods, the initial training set size is 2 example*, one of each 
class (I BPH and 1 G4), The examples of each class ate drawn at random. The 
performance of the IDA methods cannot be computed with only 2 examples, 
because at feast 4 examples {2 of each class) are required to compute intraciass 
15 standard deviations. The number of training examples are incremented by steps 
of 2, 



Learn i 

The curves presented irt this section are obtained with the following 
procedure (Table 26): 
Table 26; Experimental procedure. 

1) For each feature selects n/classification method 
(SVM, SF-SVM, LDA,SF-LDA): 

2) For each number of training examples ((2), 4, 6, 8, 10, 12, 14, 16): 
3} For each particular drawing of the training/test split 

(20 drawings, in general): 
4) Compute the feature subset ranking (using the training examples only), 

3) For each subset of genes of size 0, U 2... u? a !og2 scale: 

6) Compute the weights of the etas-sifter using the training examples. 

7) Compute the j •. . : . 
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Overall, SF-SVM performs best, with the following four quadrants 
distinguished (Table 2?}i 



Mum, Ex. 


small 


large 


Num. Genes 


Largs 


SF VvM *bcsi. Si .g'e feature 
methods (SF-SVM and SF-LDA) 
outperform multivariate methods 
V M and LDA) 


M tn - ' n !h J nu h 
best. The differences are not 
statistical ly significant. 


Small 

; 


SF-LDA perfonns best.LDA 
performs worst and single feature 
methods outperform multivariate 
methods. 

- - 


DA per' • s$ > i - << > 
unclear whether single 
feature methods perform 
heifer. SF-S VM may have an 
advantage. 



Table 27t Best performing methods of feature soJmiojFclassification. 

5 The choice of Wi=0 for negative margin genes in SF-SVM 

(Equation 3) COtresposds to an implicit pre-selection of genes and partially 
explains why SF-SVM performs do well for large numbers of genes. In fact,, no 
genes are added beyond me total number of genes that separate perfectly G4 from 
BPH. 

H) Final ru n ;) d the examples 

Ail methods were re-rtm using the entire data set. The top ranked genes 
are presented in Tables 28-31. Having determined that the SVM method 
provided she most compact set of features to achieve 0 leave-one-om error and 
that the SF-SVM method is the best and most robust method for stsuil numbers 

15 of training examples, the top genes found by these methods were researched in 
die literature. Most of the genes have a connection ro cancer or more specifically 
to prostate easeer. 
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Table 28: Top ranked genes lor SF J .DA using J? best BHP/G4. GAN*Oene 
Acessron Nombe EXP»Exj >> S » J expressed in cancer (G4) (issues . 
5 +{-overexpressed in cancer 'tissues). 



Ran} 


jGAN 


EXP Description 


10 


Ih»5S2 






f ;>,, 




7 


£> i 7 f 
L24203 


' jhum n n R\ \ for GC 'x^ h • Jiv .. mtun 
1 1 t> s is ataxt nRiect i jjroop j 


6 


00».:?4 


-* porw sapiens 50 IsDajvix I epidsrniaikeralin 


5 


Di 056? 


- ! jBuioat! mRNA ior smooth muscle myosin heavy 


* 


[f0324i 


_ ( (Human trans < beta 3 (TGF 




bl7760 


-! tn ,m;],WBiisc:;: 




JKI67i7„.. 




i PSS34I6 


-! i IB, sapiens PrP gene 



Tabk 29: Top ranked genes for IDA using 17 best BHP/G4. 
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Table 30: Top ranked geaes for SF SVM us&ng 17 best BH1VG4. 
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5 Table 31 : Top ranked genes for SVM using 17 best BHP/G4. 

Results on 42 tissues 

Using the ''true" leave-one -out method (including gene selection and 
10 classification), the experiments indicated that 2 genes should suffice to achieve 
1-08% prediction accuracy. The two top genes were therefore more particularly 
researched in the literature. The results are summarized in Table 33. It is 
interesting to notice that the iwo genes selected appear frequently in the top 10 
lists of Table 28-31 obtained by training only on the 17 best genes. 
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Table 32: Top ranked genes for SVM using all 42 BHP/G4, 



GAN 


Synonyms 


Possible function/ link to prostate cancer 


M1693S 


HOXC8 


Box genes encode transcriptional regulatory 
proteins thai arc largely responsible for establishing 
the body plan of all metazoan organisms. There are 
hundreds of papers in PubMed reporting the role of 
HOX genes in various cancers- HOXC5 and 
HGXCS expression are selectively turned on in 
human cervical cancer cells compared to normal 
keratinoeytes. Another home >bo> : :nc (GBX2) 
may participate in metastatic progression in 
prostatic cancer. Another HOX proiein (hoxb-13) 
was identified as an androgen-mdependent gene 
expressed in adult mouse prostate epithelial ceils. 
! < at th t tdi • t! ti thi pn ids: a new 
potential target for developing therapeutics to treat 
'd 3 l -d prostate cancer 


035735 


Jk 

Kidtf 

RACHJ 

RACH2 

SLCI4AI 

UTl 

UTE 


Overexpression i i R iCH2 in human tissue culture* 
ceils induces apoptosts, RACHl is downregulsted 
m breast cancer ceil line MGF-7. RACH2 
complements the RAD 1 protein. RAM is 
mplicated in several cancers Significant positive 
od scores of 3. 39 for linkage of the Jk (Kidd blood 
group) with cancer family syndrome (CPS) were 
jbtained. CPS gene(s) may possibly be located on 
;hront0some 2„ where Jk is located. 



5 Tabl*. 33; Findings for «»«. u> 2 -.<.>> i-mui ? las .'.V. BHP.'<G4. 

Taken together, the expression of these two genes is indicative of the seventy of 
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the disease (Table 34). 
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Table 34; Severity of the disease as indicated by the top 2 racking genes 
5 selected by SVMs using all 42 SPH and G4 tissues, 

! 'it ' 1 - „_ • - A-'„ " 
T-test 

One of the reasons for choosing SF-LDA as a reference method to 
10 compare SVMs with is that SF-LDA bears a Sot of resemblance with one of the. 
gene ranking techniques used by Asymetrix, indeed, Affymesrix uses that pvajae 
of the T-test to rank genes. While not wishing to be bound by any particular 
theory, it k believed that the null hypothesis to he tested is the equality of the two 
expected values of the expressions of a given gene for class (*) BPH and class t» 
15 04. Hie alternative hypothesis is thai the one with largest average value has the 
largest expected value. 'Hie pvahae is a monotonieaily varying function of the 
quantity to be tested: 

%- (pt(+)-|t^)5 / (0.;sqrt(l/p(+)+i/p(-3) 
where {}.r;(+)-^!(-} are the means the gene expressk t )< ei f al! . 

20 'tissues of class (+} or class (-), i~I,.... n. p{+) and pf) are the number of examples 
of class {+) or class (--}.; {p(+} + p(-) o,(-) 2 >'p is the intra-class 

variance. Up to a constant factor, which does not affect the ranking, T, is the 
same criterion as w n Equation (2) used for rank in tui b\ SF-LDA 
It was pointed out by Affymetrix that the pvaiue may be used as a measure of risk: 

25 of drawing the wrong conclusion thai a gene is relevant to prostate cancer, based 
on examining the differences n the means Assume all the genes with pvsJae 
lower than a threshold a are selected. At most a fraction a of those genes should 
Ik; bad choices However, this irtterpremtion :S no: quite accurate since the gene 
expression values of different genes . the same chip are not independent 
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experimems. Additionally, this assumes the equality of the variances of the two 
classes, which should be tested. 

There are variants in the definition of T; that nay account for small 
difference* In gene ranking. Another variant of the method is to restrict the list of 
5 i erie £ t n ait 

BPH tissues (or vice versa). For the purpose of comparison, a variant of SF-LDA 
was also applied in which only genes the: perfectly separate 3FH from G4 in the 
training dais were used. This variant performed similarly to SF-LDA for small 
numbers of genes (as it is expected that a large fraction of the genes ranked high 
10 by SF-LDA also separate perfectly the training set). For large numbers of genes it 
performed similarly to SF-SVM (all genes that do not separate perfectly the 
training set get a weight of zero, all the others are selected, like for SF-SVM), But 
it did not perform better than SF-SVM, so it was not retained. 
SM&riag 

15 Another technique that Affymetrk uses is clustering, and mors 

specifically Self Organizing Maps (SOM), Clustering can be used to group genes 
into clusters and define "super-genes" (cluster centers). The super-genes thai are 
over-expressed for G4 and underexpressed for BPH' examples (or vice versa) are 
identified (visually). Their cluster members are selected. The intersection of these 

20 selected genes and genes selected with the T-test is taken to obtain the final gene 
subset. 

Clustering is a- means of regularfcattoii that reduces the dimensionality of 
feature space prior to feature selection. Feature selection is performed on a 
smaller number of "super-genes". Combining clustering and feature selection in 
25 this way on this particular dm set will be the object of future work. 
V. Conclusions 

Meaningful feature selection can be performed with as little as 17 examples and 
7129 features. On this data set single feature SVM performs the best 
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EXAMPLES 

3 Application of S VM RFE to lymphoma 

SVM RFE outperforms Golub's melhod «igni6cantiy in a wide range of 
values of training dataset sizes and the number of selected input variables on 
certain data sets. This data set includes 96 tissue samples (72 lymphoma and' 24 
10 non-cancer) for which 4026 gene expression coefficients were recorded, A 
simple preprocessing was performed and missing values were replaced by zeros. 
The dataset was split into training and rest sets in various proportions and each 
experiment was repeated on 96 different splits. Variable selection was performed 
w tit he RFE a rnallest weights s 

15 repeatedly. 
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The gene sc: sizt v it hmicaliy, apart from the test 64 

genes, which were removed one at a time according to the RFE method. FIG, 30, 
5 demonstrates the learning curves when the number of genes varies in the gene 
elimination process. The success rate is represented as a function of the bathing 
set she and the number of genes retained in the gene eHmimtjou process. For 
comparison, FIG, 31 depicts the results obtained by a competing technique by 
GoJub that uses a correlation coefficient to rank order genes* 

10 w ; . f>,<+3 - ifrW*f&. + «.«> 

where m- and <j,< are the mean and standard deviation of the gene expression valines 
of a particular gene i for all the patients of class <+} or class (-), i = !, 
Large positive w s values indicate strong correiation with class (+) whereas large 
negative w ( values indicate strong correiation with class (-). One method is to 

15 select an equal number of genes with positive and with negative correlation 
coefficient. 

Classificariot s perfon in f ach contributing 

to the final decision by voting according to the magnhade of their correlation 
coefficient Other comparisons with & number of other methods including 
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Fishery discriminant, decision trees, and nearest neighbors have confirmed the 
superiority of SWs, 

it. should be understood, of course, that the foregoing relates only to 
preferred embodiments of the present invention and that numerous modifications 
5 or alterations: may be mad ten; v. ut departing from the spirt md the s< fx 
of the invention as set form in the appended claims- Such alternate embodiments 
are considered to be encompassed within the spirit and scope of the present 
invention. Accordingly, the scope of the present invention is described by the 
appended claims and is supported by the foregoing description. 
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Claims 

What is claimed is: 

!. A core ng patterns in data, 

5 she method comprisi ng: 

(a) inputting into a classifier a training set having known outcomes, the 
classifier comprising a decision function having a plurality of weights, each 
having a weight value, wherein the training set comprises features corresponding 
to the data and wherein each feature has a corresponding weight; 
10 (h) optimizing the plurality of weights so that classifier error is 

minimized; 

(c) computing ranking criteria using the optimised plurality of weights; 
(tl) eliminating at least one feature corresponding to the smallest ranking 
criterion; 

15 (e) repeating steps {a) through (d) for a plurality of iterations until a 

subset of features of .predetermined size remains;' and 

if) inputting into the classifier a live set of data wherein the features 
within the live set are selected according to the subset of features. 

20 2, The method of claim i, wherein the classifier is a support vector 

machine. 

3- The method of claim I, wherein the classifier is a soft margin 
support vector machine. 

25 

4. The method of claim 1, wherein the ranking criterion 
conesponding to a feature is calculated by squaring the optimized weight for the 
corresponding feature. 
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5. The method of claim i. wherein the decision function is a 
quadratic function. 

6. The method of claim i. wherein step (d) comprises eliminating a 
5 plurality of features corresponding to the smallest ranking criteria in a single 

iteration of steps (a) through (d). 

7. The method of claim 1 wherein step 

plurality of features corresponding to the smallest ranking criteria in at least the 
10 first iteration of steps {a} through (d) and in later iterations, eliminating one 
feature for each iteration. 

8. The method of claim I, wherein step (d) comprises eliminating a 
plurality of features corresponding to the smallest ranking criteria so that the 

1 5 n umber of fe at«res is reduced by a factor of two for each Iteration. 

9. The method of claim 1 . wherein the training set and the live set 
each comprise gene expression data obtained from DNA micro-arrays. 



20 10. The method of claim i, further comprising pre-processing the 

training set and the hve set so that the features are comparably scaled. 
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